spin up a vllm server on hugging face jobs in one command

source: hugging face blog: run a vllm server on hf jobs in one command

level: technical

you can start a vllm server on hugging face jobs using the hf jobs run command. after installing huggingface_hub and logging in, run a command like hf jobs run --flavor a10g-large --expose 8000 --timeout 2h vllm/vllm-openai:latest vllm serve qwen/qwen3-4b --host 0.0.0.0 --port 8000. this gives you a url where the server is reachable, protected by your hugging face token. the server takes a few minutes to download model weights and boot.

once running, you can query the endpoint using curl or the openai python client. for curl, send a request to https://<job_id>--8000.hf.jobs/v1/chat/completions with your token as a bearer token. in python, set the openai client's base_url to the job url and use your token as the api key. the endpoint is private and requires a token with read access to the job. stop the server with hf jobs cancel <job_id> to avoid extra costs, as billing is per second.

the same approach scales to larger models by choosing a bigger flavor and adding --tensor-parallel-size. for example, a 122b model on two h200 gpus uses --flavor h200x2 and --tensor-parallel-size 2. you can also add a gradio chat interface, ssh into the container for debugging, or use the endpoint as a backend for a coding agent like pi. for production use, consider inference endpoints instead, which offer access control and scale-to-zero.

why it matters: this lets ai practitioners quickly deploy and test large language models without managing infrastructure, speeding up experimentation and evaluation.

source: hugging face blog: run a vllm server on hf jobs in one command