level: technical
executorch extends pytorch to run ai locally on constrained edge devices. arm created practical jupyter labs that walk through deployment on arm cpus and ethos-u npus. the labs cover exporting models to a lightweight .pte format, lowering them for hardware, and using backends like xnnpack for cpu acceleration. they also show how to delegate parts of a model to an npu for faster inference.
on a raspberry pi 5, executorch with xnnpack reduced latency for an opt-125m transformer compared to standard pytorch. however, sustained high performance caused thermal throttling without active cooling. the labs explain that backend selection is critical: without xnnpack, executorch can be slower than pytorch. for npu acceleration, models must be quantized to int8 and lowered to the tosa intermediate representation. the ethos-u partitioner then splits the graph, sending supported ops to the npu and the rest to the cpu.
the labs use model explorer with arm-developed adapters to visualize how models are partitioned. this helps identify fragmentation when unsupported operators force frequent cpu-npu switches. for example, adding an lrn layer to mobilenetv2 broke a single npu subgraph into multiple segments. the labs provide executable code so developers can experiment on their own hardware and understand performance trade-offs.
why it matters: it gives ai developers a practical way to optimize pytorch models for low-power edge devices, reducing latency and improving privacy.