level: technical
over half the world speaks more than one language, and code-switching between languages mid-sentence is common. yet voice agents often fail to handle this. a team built a benchmark to test automatic speech recognition (asr) on code-switched speech in enterprise settings like hr and it support. they covered spanish-english, french-english, canadian french-english, and german-english pairs, using real-world scenarios such as benefits inquiries and password resets.
they evaluated seven asr systems using word error rate (wer), semantic wer (swer), and answer error rate (aer). elevenlabs scribe v2, google gemini 3 flash, and assemblyai universal 3-pro led in accuracy. gemini 3 flash excelled in meaning preservation despite slightly lower raw transcription scores. openai whisper large v3 turbo performed worst, often translating code-switched speech into english instead of transcribing it. the cost of code-switching varied by model and language pair, with top models showing only a small penalty compared to monolingual baselines.
analysis showed that the number of language switches predicted whether an error occurred, while the code-mixing index (cmi) predicted error severity. surprisingly, errors concentrated on english portions of utterances, not the matrix language. this suggests that even models strong in english struggle when it appears unexpectedly within another language context. the benchmark and data are publicly available to help improve multilingual voice agents.
why it matters: improving asr for code-switching is critical for voice agents in customer service, where transcription errors can misroute tickets or misunderstand requests, directly affecting operational efficiency.