six months of llm progress in five minutes

source: simon willison: the last six months in llms in five minutes

level: technical

the last six months saw a critical shift in large language models, especially for coding. november 2025 marked an inflection point where the top model changed hands five times among openai, anthropic, and google. the pelican riding a bicycle test, a whimsical benchmark, illustrated model improvements. claude sonnet 4.5 started the month as the best, but gpt-5.1, gemini 3, and claude opus 4.5 all briefly held the crown. practitioners generally agreed opus 4.5 led for the next couple of months.

the real breakthrough was coding agents becoming reliably useful. after months of reinforcement learning from verifiable rewards, agents like openai codex and anthropic claude code crossed a quality threshold. they went from often-working to mostly-working, enabling daily use for real tasks without constant error correction. during the december to january holiday break, many developers experimented with these tools, leading to a surge of ambitious projects. some, like a python implementation of javascript called micro-javascript, were technically impressive but impractical.

in february, the warelay project, renamed openclaw, gained massive attention as a personal ai assistant. these 'claws' became so popular that mac minis sold out in silicon valley, likened to digital pets or doc ock's ai arms. also in february, gemini 3.1 pro produced a notably good pelican image. recent months brought open-weight advances: google's gemma 4 series and chinese lab glm's massive glm-5.1 model. qwen3.6-35b-a3b, a 20.9gb model, ran on a laptop and drew a better pelican than claude opus 4.7, showing how far local models have come.

why it matters: coding agents have become practical daily tools, and capable open-weight models now run on consumer hardware, lowering barriers for ai development.

source: simon willison: the last six months in llms in five minutes