nemo automodel speeds moe fine-tuning 3.7x over transformers v5

source: hugging face blog: accelerating transformers fine-tuning with nvidia nemo automodel

level: technical

nvidia nemo automodel is an open library that subclasses huggingface's automodelforcausallm, keeping the same api while adding expert parallelism, deepep fused all-to-all dispatch, and transformerengine kernels. it uses transformers v5's dynamic weight loading to support many model families without per-model checkpoint changes. the result is faster fine-tuning for mixture-of-experts models like qwen3-30b-a3b and nemotron 3 nano 30b a3b.

on a single node with 8 h100 gpus, nemo automodel with expert parallelism of 8 achieved 11,340 tokens per second per gpu for qwen3-30b-a3b, a 3.69x speedup over transformers v5's 3,075. peak memory dropped from 68.2 gib to 48.1 gib. for nemotron 3 nano 30b a3b, throughput reached 15,421 tokens per second, 3.36x faster than v5, with memory falling from 62.1 gib to 42.5 gib. transformers v4 deadlocked on qwen3 due to mismatched fsdp collectives from per-expert modules.

the speedup comes from three sources: expert parallelism shards expert weights across gpus, cutting memory use; deepep fuses token dispatch and combine into optimized kernels that overlap communication with computation; and transformerengine kernels accelerate attention, linear layers, and normalization. for a full fine-tune of the 550b nemotron 3 ultra across 16 nodes, nemo automodel ran where transformers v5 ran out of memory, thanks to expert parallelism sharding experts across 64 gpus.

why it matters: data scientists can fine-tune large mixture-of-experts models faster and with less memory using a single import change, making multi-gpu training more accessible.

source: hugging face blog: accelerating transformers fine-tuning with nvidia nemo automodel