source: pytorch blog: running pytorch models on apple silicon gpus with the executorch mlx delegate

level: technical

the executorch mlx delegate is a new backend that compiles and runs pytorch models on apple silicon gpus through apple's mlx framework. it integrates with the pytorch 2 export stack, using torch.export for graph capture and torchao for quantization. the delegate handles graph partitioning, serialization, and dispatching operations to metal gpu kernels at runtime. it supports around 90 aten ops, covering transformer inference needs like quantized matmul, multi-head attention, and mixture-of-experts routing.

the delegate offers significant performance gains, achieving 3 to 6 times higher throughput on generative ai workloads compared to existing executorch delegates on macos. it supports a range of precision and quantization options, including bf16, fp16, fp32, and 2, 4, and 8-bit affine quantization via torchao. nvfp4 quantization is also supported. a single quantized model definition can target multiple backends, and the delegate enables portable applications through executorch's unified runtime api across backends like xnnpack, coreml, vulkan, and cuda.

validated models include dense transformers such as llama 3.2, qwen 3, phi-4 mini, and gemma 3, as well as sparse mixture-of-experts models like qwen 3.5 35b-a3b. speech-to-text models like openai whisper, nvidia parakeet, and mistral voxtral are supported for both offline and real-time streaming transcription. over 30 additional models have been tested. the delegate is experimental and under active development, with apis subject to change. users can get started with provided readmes and export instructions for each model.

why it matters: it brings faster, gpu-accelerated local inference to apple silicon macs within the pytorch ecosystem, simplifying deployment of large language and speech models.


source: pytorch blog: running pytorch models on apple silicon gpus with the executorch mlx delegate