source: kdnuggets: best small language models on hugging face right now!
level: technical
a 4-billion-parameter model from early 2025 now beats models seven times larger on reasoning benchmarks. google's gemma 3 4b scores 89.2% on gsm8k math reasoning. microsoft's phi-4-mini at 3.8b hits 83.7% on arc-c, the highest in its size class. these scores used to require 30b+ models. small models under 7b parameters can run on a single consumer gpu, laptop, or smartphone, avoiding cloud costs and api limits.
three changes drove this progress. better training data focused on quality over quantity, like phi-4-mini trained on 5 trillion tokens of reasoning-dense synthetic and filtered data. distillation from frontier models, such as deepseek-r1-distill-qwen-1.5b learning step-by-step reasoning from a larger teacher. architectural improvements like mixture-of-experts, where gemma 3n e4b activates only 4b of its 8b parameters per token, and longer context windows up to 128k in sub-5b models.
top picks include qwen3.5-4b with a 262k token context window and apache 2.0 license, microsoft phi-4-mini-instruct at 3.8b with strong english reasoning and a 2.49 gb quantized file, and google gemma 3 4b it with 71.3% on humaneval and multimodal input. each model's code example shows how to load and run inference using transformers, with options for quantization and chat templates.
why it matters: small language models let you run capable ai locally on modest hardware, cutting costs and latency for tasks like reasoning, code generation, and document processing.
source: kdnuggets: best small language models on hugging face right now!