run claude code with local models

source: kdnuggets: pairing claude code with local models

level: technical

agentic coding sessions with claude code can burn 10 to 50 times more tokens than a plain chat, making api costs add up fast. rate limits can interrupt long workflows, and relying on a third-party api introduces pricing and availability risks. local models in 2026 are good enough for tasks like code completion, refactoring, debugging, and codebase explanation. a well-chosen quantized model running locally covers most real use cases at zero per-token cost and with no rate limits.

claude code sends requests in the anthropic messages api format. setting the anthropic_base_url environment variable redirects those requests to any server that speaks the same format, including ollama, lm studio, and llama.cpp. you also need to set anthropic_api_key and anthropic_auth_token to placeholder strings, and map claude code's model tier requests (sonnet, haiku, opus) to your local model name using anthropic_default_sonnet_model and similar variables. ollama added native anthropic api support in january 2026, lm studio added a /v1/messages endpoint in version 0.4.1, and llama.cpp has had direct support for longer.

ollama is the easiest starting point, handling model management and serving with simple commands. after installing ollama and pulling a model like glm-4.7-flash, you set the environment variables and launch claude code. lm studio offers a graphical interface for browsing models and starts a local server on port 1234. llama.cpp gives direct control over inference parameters and is suited for servers needing low overhead. all three backends work by pointing anthropic_base_url to the local server address and configuring the model name mappings.

why it matters: using local models with claude code eliminates per-token costs and rate limits, making agentic coding workflows more predictable and affordable for data scientists and developers.

source: kdnuggets: pairing claude code with local models