How to Run Claude Code on a Local vLLM Model Using LiteLLM Proxy

TL;DR: Claude Code only speaks the Anthropic Messages API. vLLM only speaks the OpenAI Chat Completions API. To run Claude Code against your local vLLM model, you put a LiteLLM proxy in the middle, point Claude Code at the proxy, and you are done. Total setup is about 20 minutes if you already have vLLM running.

I have been running this setup on a workstation for the last week. The reason is simple. My Anthropic bill for Claude Code crossed a number this month that I was no longer comfortable paying for marginal gains over a local Qwen 3.5 model.

The good news is that the integration is solid once you know the shape of the problem. The bad news is that the docs from Anthropic, vLLM, and LiteLLM each cover one piece of the puzzle and none of them stitch the whole thing together.

This is the stitched version. I will walk through what each component is doing, show the exact config, and flag the three things that broke for me before I got it working.

Claude Code Local vLLM Setup

Why the Two APIs Do Not Talk Directly

Claude Code expects POST requests to a /v1/messages endpoint that follows Anthropic’s Messages API schema. vLLM serves an OpenAI-compatible /v1/chat/completions endpoint with a different request and response shape.

A direct connection fails because Claude Code does not understand what comes back from vLLM, and vice versa.

Claude Code vLLM API mismatch diagram

The differences are not cosmetic. The Anthropic Messages format groups system prompts separately, structures tool calls differently, and uses a different streaming event format. Translating between them is a real piece of work.

This is exactly the gap LiteLLM was built for. LiteLLM accepts Anthropic-formatted requests on the front, translates them to OpenAI format on the back, sends them to vLLM, and translates the response back. From Claude Code’s perspective it looks like it is talking to api.anthropic.com.

The other path is to use a Claude Code-specific proxy like fuergaosi233/claude-code-proxy. From what I have seen, LiteLLM is the cleaner bet because it also gives you OpenAI-compatible endpoints for other tools you might want to point at the same model later.

The Components You Need Running

You need three things up and running before any wiring happens: vLLM serving your model, LiteLLM running as the proxy, and Claude Code installed locally with the right environment variables set.

Each runs as a separate process. None of them know about each other until you tell them.

Three component setup for Claude Code local vLLM
ComponentWhat it doesDefault port
vLLMServes your local LLM with an OpenAI-compatible API8000
LiteLLM ProxyTranslates Anthropic format requests into OpenAI format4000
Claude CodeThe CLI you run; it talks to LiteLLM(client only)

I run all three on the same workstation. You can split them across machines if you want; just point the URLs at the right hosts.

The model I used for testing was Qwen/Qwen3.5-Coder-32B-Instruct because it has decent tool-calling support. Models without strong tool calling will fail silently in Claude Code’s agent loop. That is the first gotcha. If you pick a model that cannot do structured tool use, Claude Code will sit there spinning while the model returns plain prose instead of the tool call format the harness expects.

Step by Step Setup

Here is the sequence I would walk through on a fresh machine. Each step assumes the previous one finished cleanly. Total wall time is about 20 minutes if your model is already downloaded.

  1. Start vLLM serving your chosen coding model on port 8000
  2. Install LiteLLM proxy with pip install 'litellm[proxy]'
  3. Write a litellm_config.yaml that maps an Anthropic model name to your local vLLM endpoint
  4. Start LiteLLM proxy on port 4000 pointing at that config
  5. Set Claude Code environment variables to redirect API calls to localhost:4000
  6. Run claude and verify it is hitting your proxy by watching LiteLLM logs

The pieces that trip people up are step 3 and step 5. Let me show the actual files.

vLLM launch command (Step 1):

vllm serve Qwen/Qwen3.5-Coder-32B-Instruct \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

The --enable-auto-tool-choice and --tool-call-parser flags are non-negotiable. Without them, vLLM will not produce tool calls in the format LiteLLM expects to forward.

LiteLLM config (Step 3):

model_list:
  - model_name: claude-3-5-sonnet-20241022
    litellm_params:
      model: openai/Qwen/Qwen3.5-Coder-32B-Instruct
      api_base: http://localhost:8000/v1
      api_key: dummy

litellm_settings:
  drop_params: true

The trick here is that you label the model as claude-3-5-sonnet-20241022 even though the underlying model is Qwen.

Claude Code sends requests for that model name. LiteLLM intercepts and routes to vLLM. Claude Code never knows the difference.

The drop_params: true setting silently discards Anthropic-specific parameters that OpenAI does not understand, instead of erroring out. Without it, you will see 400 responses on roughly half your requests.

Claude Code env vars (Step 5):

export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="sk-anything"
export ANTHROPIC_MODEL="claude-3-5-sonnet-20241022"

ANTHROPICBASEURL is the override that redirects Claude Code from api.anthropic.com to your local proxy. ANTHROPICAUTHTOKEN can be any non-empty string because LiteLLM does not enforce auth in this config.

Run claude and you should see the LiteLLM logs light up with incoming requests.

The Three Things That Broke for Me

The three failure modes I hit, in the order I hit them, were tool-call formatting, streaming response truncation, and Claude Code’s hidden retry behavior masking real errors. All three are fixable. None of them are documented in one place.

The first issue was tool-call formatting. Qwen3.5-Coder claims to support tool calling, but vLLM’s default parser misread the model’s output. The fix was the --tool-call-parser hermes flag in the vLLM launch. Different models need different parsers. Llama models want llama3_json, Mistral wants mistral. If your model is producing tool calls but Claude Code is not seeing them, this is almost always the cause.

Symptom and fix table:

SymptomLikely causeFix
Claude Code spins forever waiting for a tool callModel’s tool calls are not being parsed by vLLMSet --tool-call-parser to match your model family
400 errors on roughly half your requestsLiteLLM forwarding Anthropic-only params to vLLMAdd dropparams: true to litellmsettings
Streaming responses cut off mid-sentenceLiteLLM proxy timing out the SSE stream earlyAdd requesttimeout: 600 to litellmsettings
claude command fails with auth errorANTHROPICBASEURL not exported in current shellRe-export and confirm with `env \grep ANTHROPIC`
Model returns prose instead of structured tool callsModel lacks proper tool calling supportSwitch to a model with native tool support like Qwen Coder, DeepSeek Coder, or GLM Code

The second issue was streaming. Claude Code expects SSE responses for long completions. LiteLLM forwards the stream by default, but the proxy’s request timeout can cut off long generations. Set requesttimeout: 600 in litellmsettings if you see truncated output.

The third issue was the most painful. Claude Code retries failed requests silently up to three times, which means a malformed config can look like it is working slowly when it is failing on every attempt and retrying behind the scenes.

The way I caught it was by tailing the LiteLLM logs in a second terminal and watching the request count climb faster than the Claude Code UI implied.

Why This Is Worth Doing for Some Workflows

It is worth doing if your monthly Claude API spend is meaningful and the model gap is narrow for your specific code. It is not worth doing if you are doing exploratory work that benefits from frontier reasoning.

For routine refactoring, test generation, and boilerplate, a 32B coding model on a workstation is close enough to Claude Sonnet that the difference does not show up in finished commits. For architectural exploration, novel algorithms, or anything where the model needs to make non-obvious connections, the frontier model still wins by a margin you will feel.

The setup also pairs well with the Claude Code subagent pattern, where you can use cheap local inference for the parallel research agents and reserve frontier API calls for the synthesis layer. Routing through LiteLLM makes that split a config-file change instead of a code change.

If you are coming from a different agent setup and exploring options, it is worth seeing how the broader coding agent space is restructuring before committing to a local-first workflow. The short version is that local inference plus a thin proxy layer is becoming a more credible substitute for hosted agents than it was six months ago, and tools like Make.com make it easy to bolt webhook triggers onto whatever agent loop you settle on.

The other angle worth flagging: a workstation running a 32B model continuously costs real electricity. For most solo developers, the math still favors local once your monthly Anthropic spend crosses about $80. Below that, you are paying for the optionality of frontier capability and that is fine.

Frequently Asked Questions

Can I use this same setup with Codex or Cursor instead of Claude Code?

You can. LiteLLM serves both Anthropic and OpenAI compatible endpoints from the same proxy. Point Cursor or Codex at the OpenAI endpoint on port 4000 with the same config, and they will hit the same vLLM model. You can run all three tools against one local model.

What model should I start with for coding work?

Start with Qwen3.5-Coder-32B-Instruct if you have 24GB of VRAM or more. For 16GB, drop to DeepSeek-Coder-V2-Lite-Instruct. For 8GB, you are below the floor where local coding models are good enough to replace frontier APIs and should stay on the hosted route.

Does this setup support multimodal inputs like screenshots?

Not with the configuration shown. Claude Code’s multimodal pathway goes through Anthropic’s vision API, which has no direct vLLM equivalent for arbitrary models. You can add vision-capable local models, but the wiring is more involved and outside the scope of this setup.

How much does this save versus the Anthropic API?

For a workflow that was costing $200 a month on Claude Sonnet, my electric bill went up about $15 and the API spend dropped to $20 (kept for the frontier reasoning escape hatch). Net savings about $165 a month. The math gets better at higher API spend levels and worse at lower.

Can I run this without LiteLLM, using only vLLM?

You cannot, because Claude Code does not speak OpenAI. You need either LiteLLM, fuergaosi233/claude-code-proxy, or one of the other Anthropic-to-OpenAI translators in the middle. Pick one and stick with it; running two proxies in series is a debugging nightmare.

Does Claude Code see this as the real Claude or as a local model?

Claude Code sees whatever model name the proxy returns in responses. Because we map the local model to claude-3-5-sonnet-20241022 in the LiteLLM config, Claude Code’s UI will display “Sonnet” even though the actual inference is happening on Qwen. This is intentional and works fine, just be aware when you are debugging that the model name is a label, not a fact.

Leave a Reply

Your email address will not be published. Required fields are marked *