tuning local llm settings with ollama

source: kdnuggets: tweaking local language model settings with ollama

level: technical

ollama lets you run language models locally, but default settings often prioritize safe chat over performance. to get better results for tasks like coding or data extraction, you need to adjust model parameters. the modelfile works like a dockerfile, defining a base model, system instructions, and sampling parameters. you can set temperature, top-k, top-p, and min-p to control randomness and token selection. low temperature values make outputs more deterministic, while min-p dynamically filters tokens based on the top token's probability.

to prevent repetitive loops, use repetition, presence, and frequency penalties. stop sequences halt generation when the model produces unwanted text. managing context length is crucial because attention computation scales quadratically with token count. increasing num_ctx allows longer inputs but demands more vram. kv cache quantization, set via environment variables, compresses the key-value cache to save memory. options like q8_0 and q4_0 reduce vram usage with minimal quality loss.

server environment variables control the ollama daemon's behavior. you can bind the server to a network interface, change model storage location, and set how long models stay loaded in memory. enabling flash attention speeds up prompt processing on supported hardware. for prompt formatting, ollama uses go templates to convert chat histories into raw text with special tokens. this ensures the model correctly interprets system, user, and assistant roles.

why it matters: properly tuning ollama settings prevents silent failures like context truncation and repetitive outputs, making local models reliable for production tasks.

source: kdnuggets: tweaking local language model settings with ollama