source: hugging face blog: which tokens does a hybrid model predict better?
level: research
researchers compared a 7b hybrid model, olmo hybrid, with a matched transformer, olmo 3, to see which tokens each predicts better. the models shared the same data, tokenizer, and training recipe, so differences stem from architecture. the hybrid uses a mix of attention and recurrent layers, while the transformer relies solely on attention. by measuring the loss gap—the difference in prediction error—they found the hybrid generally outperforms the transformer, but the advantage varies by token type.
the hybrid's edge is largest on content words like nouns, verbs, and adjectives, with a loss gap around 0.04, compared to about 0.02 for function words such as 'the' and 'of'. it also does well on tokens requiring state tracking, like pronoun reference. however, the hybrid's lead nearly vanishes on tokens that repeat earlier text verbatim, where the transformer's attention can directly copy. similarly, for closing braces in code or markup, the transformer matches the hybrid, as attention handles bracket matching easily.
the team also tested filtered losses on 1b models—transformer, hybrid, and pure recurrent—during pretraining. on meaning-bearing non-repeated tokens, the hybrid and recurrent models beat the transformer, with the hybrid best. on repeated tokens, the pure recurrent model lagged, lacking attention to retrieve copies. these fine-grained losses reveal architecture strengths early in training, suggesting that evaluating specific token types can guide better hybrid design.
why it matters: understanding token-level strengths helps build better hybrid models by combining attention for copying and recurrence for tracking meaning.
source: hugging face blog: which tokens does a hybrid model predict better?