source: hugging face blog: how to fine-tune nemotron 3.5 asr for your language, domain, or accent
level: technical
nvidia released nemotron 3.5 asr, a 600m-parameter streaming speech-to-text model that handles 40 language-locales from a single checkpoint. it uses a cache-aware fastconformer-rnnt architecture to deliver real-time transcription with low latency and built-in punctuation and capitalization. the model is available as open weights on hugging face, allowing inspection, fine-tuning, and deployment without api dependencies.
the post demonstrates fine-tuning on greek and bulgarian, two mid-resource languages. using about 290 hours of public speech data, a full fine-tune reduced word error rate by 32% for greek and 31% for bulgarian on held-out test sets in the most demanding streaming mode. adding more data further improved results, though gains varied by language and domain. the workflow involves preparing tarred audio with correct language tags, training from the base checkpoint, and evaluating at deployment latency.
key lessons include evaluating on held-out data at the target latency, using accurate language tags for prompt conditioning, and protecting other languages by mixing in a small amount of their data during fine-tuning. the fine-tuned model uses the same architecture and can be deployed with the same inference options, including adjustable attention context size for latency-accuracy tradeoffs. a companion github repo provides scripts and configs for the full process.
why it matters: fine-tuning lets developers adapt a single multilingual asr model to specific languages, accents, or domains, improving accuracy for voice agents, live captions, and call-center analytics without managing multiple models.
source: hugging face blog: how to fine-tune nemotron 3.5 asr for your language, domain, or accent