soro: a lightweight tajik language model and chatbot

source: arxiv artificial intelligence: soro: a lightweight foundation model and chatbot for tajik

level: research

soro is a set of conversational large language models made for the tajik language. the models are built from open-weight gemma 3 checkpoints. they go through tajik-only continual pretraining on a 1.9-billion-token corpus. this corpus includes filtered web text, pdf documents, and educational materials aligned with the curriculum. after pretraining, the models are fine-tuned with supervised instruction tuning on 40,000 tajik teacher-style examples. the goal is to work well in tajikistan where computing power and internet connectivity are limited.

evaluating tajik language models is hard because standard benchmarks do not cover the language well. to fix this, the team created a new set of tajik benchmarks. these tests measure general knowledge, linguistic skills, and performance on school and university entrance exams. the benchmarks are open-sourced on hugging face. soro models do much better on these tajik tests than the original gemma 3 models of the same size. they also keep strong english performance on standard datasets.

the models are designed for real-world use with tight compute and connectivity limits. the paper also shows that soro can be compressed using fp8 and int4 quantization. this makes the models smaller and faster without losing much quality. the work provides a practical path for deploying capable language models in under-resourced languages and regions. the open-source benchmarks and models aim to support further research and development for tajik nlp.

why it matters: this shows how to adapt large language models for low-resource languages and deploy them in constrained environments, offering a template for other underrepresented languages.

source: arxiv artificial intelligence: soro: a lightweight foundation model and chatbot for tajik