source: hugging face blog: introducing mellum2: a 12b mixture-of-experts model by jetbrains
level: technical
mellum2 is a 12b-parameter mixture-of-experts model trained from scratch on natural language and code. it activates only 2.5b parameters per token, making it efficient for high-throughput, low-latency inference. the model is released under the apache 2.0 license and is designed for routing, rag, summarization, sub-agents, and private deployments. compared to similar-sized models, mellum2 delivers competitive benchmark performance while achieving more than 2x faster inference.
the model targets latency-sensitive operations in modern ai systems that rely on multiple model calls, such as routing, retrieval, summarization, and tool use. benchmark evaluations show mellum2 is competitive with similarly sized open models across code generation, reasoning, science, and math tasks. its mixture-of-experts architecture keeps total capacity high while activating only a subset of parameters per token, reducing serving costs for real-time workloads. mellum2 focuses on text and code rather than multimodal tasks, keeping it compact and efficient for software engineering.
key use cases include routing and orchestration in multi-model systems, rag pipelines for context compression and retrieval post-processing, sub-agent tasks like planning and validation, and private deployment in self-hosted environments. the model is positioned as a fast, well-scoped component for high-frequency tasks inside larger ai systems, aiming to make stacks faster, cheaper, and easier to control without replacing every model.
why it matters: mellum2 offers a practical, open-source option for developers needing fast, efficient language models for software engineering workflows, reducing latency and cost in production ai systems.
source: hugging face blog: introducing mellum2: a 12b mixture-of-experts model by jetbrains