How to Split Your AI Agent Across Two Model Tiers

TL;DR: Most AI agents spend 75 percent of their LLM calls on tiny tool-pick decisions that do not need a premium model. Route those to a cheap LLM (GPT-OSS 120B, Haiku 4.5, DeepSeek V3) and reserve the premium model for synthesis steps. Real teams are seeing 70 to 80 percent cost reductions without a measurable quality drop.

A B2B sales builder I read this morning ran the same eight-tool enrichment agent two ways. Week one on GPT-5.4 only, processing 1,200 companies, bill came in at $290. Week two with a router-synthesis split, processing 1,400 companies, bill came in at $65.

That is a 78 percent reduction at slightly higher throughput. Same outputs, same downstream sales team, no measurable quality difference.

The pattern is not new but it is underexplained. Most cost-optimization writeups stop at “use a cheaper model for simple tasks” and leave the actual wiring as an exercise.

This piece walks through what the split looks like in code, when it beats the deterministic routing alternative, and the five edge cases that will bite you when you try this on your own agent.

If you already use deterministic code to skip the LLM entirely on routing, the deterministic-pre-LLM pattern is the right starting point and you can ignore this article. The split below is for builders whose routing is too dynamic to encode in if statements.

How to Split Your AI Agent Across Two Model Tiers

Why Most Agent Bills Are Driven by Routing Calls

An agent’s API bill is dominated by tiny tool-pick decisions, not by the heavy lifting.

Look at the call breakdown of any tool-using agent and 70 to 80 percent of the calls will be one-line “given the current state and available tools, pick the next action” decisions that produce a JSON blob the size of a tweet.

AI agent routing calls cost dominance breakdown

The remaining 20 to 30 percent are synthesis calls. Summarize this scraped page, draft this outreach paragraph, reason about whether the evidence matches our ICP. Those benefit from a real model.

The economics work because output tokens cost 3x to 10x more than input tokens, and routing calls have tiny outputs. A routing call might return 40 output tokens; a synthesis call might return 800. When 75 percent of your calls are the cheap-output kind, paying premium-model rates on them is just leakage.

The price gap between tiers is wider than most builders realize. GPT-4o output runs $10 per million tokens; GPT-4o-mini output runs $0.60. That is a 16x ratio.

Claude Sonnet 4.6 input runs $3.00 per million tokens against Haiku 4.5 at $1.00. That is a 3x ratio with a 6-point SWE-bench gap (73.3 vs 79.6) that does not matter for picking a tool from a list of eight.

The way I see it, the question is never “is the cheap model good enough for everything” because it rarely is. The question is “what percentage of my calls is the cheap model good enough for”, and the answer is usually most of them.

What the Cheap-Router Premium-Synthesis Split Looks Like

The architecture is two model tiers behind one orchestration layer, with the call site deciding which tier to hit. No fancy framework needed. The OpenAI SDK and Anthropic SDK both let you pass a different base_url and model string per call, so the split lives in a few lines of conditional logic.

Here is the call shape:

# Router calls, cheap and fast
router_client = OpenAI(
    base_url="https://api.gmi-cloud.com/v1",
    api_key=os.environ["GMI_KEY"]
)
router_response = router_client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=messages,
    tools=tools,
    max_tokens=256,        # routing outputs are tiny
    temperature=0.1,       # deterministic enough
)

# Synthesis calls, premium
synthesis_client = OpenAI(
    base_url="https://api.openai.com/v1",
    api_key=os.environ["OPENAI_KEY"]
)
synthesis_response = synthesis_client.chat.completions.create(
    model="gpt-5.4",
    messages=messages,
    max_tokens=2000,
    temperature=0.3,
)

The agent’s outer loop decides which client to call based on the step type. If the step is “pick next tool from the registry” or “decide whether to retry”, route. If the step is “summarize”, “extract”, “reason”, “draft”, synthesize.

Tier	Use this for	Model picks	Typical price
Router	Tool pick, decide-to-retry, dedupe	GPT-OSS 120B, Haiku 4.5, DeepSeek V3, Gemini Flash	$0.15 to $1.00 / M input
Synthesis	Summarize, extract, draft, reason	GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro	$2.50 to $5.00 / M input
Planner (optional)	Decompose the goal, plan tool sequence	GPT-5.4, Claude Opus 4.6	$5.00 to $15.00 / M input

I would not route the planner tier even if the cheap model looks viable. The Augment Code routing guide reports a three-tier setup (Opus plan, Sonnet implement, Haiku review) cuts costs 51 percent overall. Downgrading the planner itself produces cascading errors because downstream agents have no mechanism to correct upstream planning mistakes, so keep the planner expensive.

The single biggest configuration change that nobody mentions is max_tokens. The SDK default is 4096 for most providers, and cheap models will happily fill that with reasoning chatter when all you wanted was a tool name. Drop routing calls to 256 and the latency improves alongside the cost.

How to Wire Two Model Tiers Into One Agent

The wiring takes about an hour if your agent already has a clean call site abstraction, longer if it does not. Here is the order I would do it in.

Tag every call site in your agent code as router or synthesis. Walk through the agent’s main loop and label every chat.completions.create invocation. Most agents have between 3 and 8 distinct call sites. Tool-pick decisions are routers. Summarize-this, draft-this, extract-this, reason-about-this are synthesis.
Pick your router provider and model. GPT-OSS 120B on GMI Cloud, Together, or any OpenAI-compatible endpoint is the current sweet spot for builders who want frontier-OSS quality at $0.10 to $0.20 per million tokens. Haiku 4.5 is the right call if you are already in the Anthropic SDK and do not want to manage two providers. DeepSeek V3 is the cheapest viable option but watch the response-format strictness.
Build a thin client wrapper. Wrap each provider’s SDK in a route() and synthesize() function that takes messages and tools and returns the parsed response. The wrapper is where you swap base_url and model. Twenty lines.
Run a side-by-side validation. Take 50 representative inputs through your agent on the premium-only stack and the split stack. Compare final outputs item by item. If the split produces materially worse output on more than 10 percent of cases, the cheap model is too cheap for your routing decisions and you need to step up a tier.
Flip the default. Once validated, route everything tagged router through the cheap client and everything tagged synthesis through the premium client. Keep the premium fallback in place for the first week so you can revert any specific call site if a regression surfaces.

Before: every call uses client.chat.completions.create(model="gpt-5.4", ...) with default max_tokens=4096. Bill scales linearly with call count.
After: routing calls go to routerclient.chat.completions.create(model="openai/gpt-oss-120b", maxtokens=256, ...) and synthesis calls stay on premium. Bill drops 70 to 80 percent at the same call count.

Make.com or n8n-style automation orchestration sits well above this layer, deciding when to fire the agent at all. The split lives inside the agent, not above it.

If you are stitching workflows with Make.com, the same principle applies to any AI module call inside the scenario, route the cheap ones to cheaper models there too.

When This Split Beats the Deterministic Alternative

Use this pattern when your routing decisions are too dynamic to express in code.

Deterministic routing wins when you can predict the next step from the input shape alone: regex on the message, file extension, a small if ladder. Anything that needs to look at semantic context to decide what to do next is a job for the cheap LLM.

The 50 to 200 millisecond latency overhead of an LLM router only matters when routing sits on a user-facing critical path. For background enrichment, batch pipelines, daily research agents, and anything that processes async, the latency tax is irrelevant. The cost savings on 5,000 routing calls per week swallow the latency overhead trivially.

The pattern fails when your call volume is low. For agents running under 500 calls per day, the engineering overhead of maintaining two providers, two API keys, and a router taxonomy is not worth the $30 to $50 monthly savings. Stay on a single premium model until volume justifies the split.

The pattern also fails when the cheap model needs correction more than 20 percent of the time. If the router picks the wrong tool one in five times and you have to re-prompt with a better model to fix it, the 5x pricing advantage evaporates by the second re-prompt. Validate quality on a 50-input golden set before committing.

From my read of the production agent infrastructure patterns most teams are converging on, the router-synthesis split sits cleanly alongside idempotency keys, postgres-backed state, and per-user rate limits. It is not a replacement for any of those, it is an addition.

The Five Edge Cases That Will Bite You

Cheap models are cheap because their tool-call output is slightly less disciplined. Plan for these.

Trailing-comma JSON. GPT-OSS 120B and similar OSS models occasionally produce JSON with a trailing comma before the closing brace. Wrap your parser in a sanitizer that strips trailing commas and retries, otherwise you catch this when production breaks at 2 a.m.
Hallucinated tool names. Cheap routers will sometimes invent a tool that does not exist in your registry, especially when no available tool is a good match. Validate the picked tool name against your registry and fall back to the premium router if the validation fails. Treat this as your circuit breaker.
Verbose routing responses. Cheap models often want to explain their reasoning even when you asked for a tool pick. The max_tokens=256 cap stops the bleeding, but also instruct the model explicitly to return only the tool call and not any natural-language preamble.
Token cache misses across providers. Anthropic and OpenAI both offer up to 90 percent input-token discounts via prompt caching, but the cache lives within one provider. When you split routing to GPT-OSS and synthesis to GPT-5.4, you may lose cache hits that you had when both calls went to the same provider. Check your cache-hit rate after the switch and consider whether keeping both on the same provider (e.g., Haiku for routing and Sonnet for synthesis, both on Anthropic) is the better economic trade.
Output-token cost drift. Output tokens are 3x to 10x more expensive than input tokens, so a “cheap” router that produces 500-token responses can cost more than a “premium” model that produces 40-token responses. Force brevity with explicit instructions and stop sequences. A chatty cheap model is more expensive than a terse premium one.

For builders running longer agent sessions, the seven-layer memory stack is the other lever that compounds with this split, since cached memory hits cost almost nothing regardless of which tier handles the call.

Frequently Asked Questions

Which cheap model should I use for routing?

GPT-OSS 120B on an OpenAI-compatible endpoint is the current sweet spot for builders who want frontier-OSS quality at $0.10 to $0.20 per million tokens. Haiku 4.5 is right if you are already on Anthropic. DeepSeek V3 is the cheapest viable option.

How much will this really save me?

Real-world teams report 70 to 80 percent cost reductions on bills dominated by tool-pick calls. The Augment Code three-tier setup reports 51 percent. Your number depends on what share of your calls are routing versus synthesis.

Will quality drop?

On a 50-input side-by-side test, the cheap router should match premium quality on greater than 90 percent of cases. If the gap is bigger than that, the cheap model is too cheap for your routing decisions and you need to step up a tier.

Is this better than deterministic routing?

Deterministic routing wins when you can predict the next step from input shape alone. The LLM-LLM split wins when routing requires semantic context. They are complementary, not competing.

What about the latency overhead?

The 50 to 200 millisecond latency tax matters on user-facing critical paths. For background enrichment, batch pipelines, and async agents, the latency is irrelevant and the cost savings dominate.

When does this pattern stop being worth it?

Below 500 agent calls per day, the engineering overhead of two providers and a router taxonomy is not worth the savings. Stay on a single premium model until volume justifies the split.