source: hugging face blog: reachy mini goes fully local
level: technical
the reachy mini robot can now run its entire voice pipeline locally, removing the need for cloud servers or api keys. the system uses a cascaded approach with voice activity detection, speech-to-text, a large language model, and text-to-speech. users install a speech-to-speech library and a local llm server like llama.cpp, then point the robot's conversation app to the local backend. the default components are silero vad, parakeet-tdt 0.6b v3 for stt, and qwen3-tts, but every part is swappable.
the llm is the most impactful layer for latency and quality. the setup supports running models locally via llama.cpp, mlx, or transformers, or connecting to external servers through a responses api. for local use, gemma 4 and qwen3-4b are recommended for their speed and capability. the speech-to-speech engine exposes a websocket at /v1/realtime, compatible with the robot's existing protocol. users can also use hosted inference endpoints or providers like openai by changing the base url and api key.
running the pipeline locally offers privacy, no per-use costs, and full control over model choices. the blog provides detailed commands for various configurations, including vllm with speculative decoding for lower latency. the robot can connect to a laptop running the engine over a local network. the entire stack is open-source, with repositories available for the speech-to-speech library and the robot's conversation app.
why it matters: this enables private, cost-free, and customizable voice interactions for robots, letting developers swap models as new ones emerge without relying on cloud services.