source: arxiv machine learning: elmes*: automated construction of fine-grained evaluation rubrics for large language models in long-tail educational scenarios

level: research

evaluating large language models for education requires measuring teaching ability, not just factual correctness. current benchmarks focus on general knowledge or use manually created rubrics that do not scale to diverse, long-tail teaching situations. elmes* is a framework that automatically builds, refines, and applies detailed, scenario-specific rubrics. it uses a multi-agent system with teacher, student, and judge roles, plus a self-evolving module called scenegen that jointly improves evaluation criteria and test data based on expert-defined pedagogical dimensions.

using elmes*, researchers built edu-330, a benchmark covering 330 scenarios across 11 subjects, 3 grade levels, and 10 task types, with over 1,000 second-level indicators. experiments on edu-330 and four expert-authored gold-standard scenarios reveal that educational capability has multiple dimensions. top llms differ mainly in creativity and values, not just accuracy. this shows that a single score cannot capture teaching quality, and fine-grained evaluation is needed to understand model strengths and weaknesses in real educational contexts.

the framework addresses the challenge of evaluating llms in specialized, long-tail scenarios where manual rubric creation is too slow and expensive. by automating rubric generation and refinement, elmes* enables scalable, consistent assessment of how models explain concepts, adapt to student needs, and handle diverse subjects and age groups. this approach can help developers improve educational ai and give educators better tools for selecting and monitoring ai tutors.

why it matters: it provides a scalable way to measure teaching quality in ai, helping build better educational tools and ensuring models are safe and effective for diverse learners.


source: arxiv machine learning: elmes*: automated construction of fine-grained evaluation rubrics for large language models in long-tail educational scenarios