Stop Paying Frontier Prices for Boilerplate Work

60-80% on routed traffic Beginner 1 min read

The single biggest model-selection waste is using a flagship model as your default for everything. Classification, formatting, short rewrites, and tag extraction do not need a frontier model, and the price gap is large.

Consider Anthropic's lineup as a concrete example. Claude Opus is roughly $5 input / $25 output per million tokens, Sonnet is around $3 / $15, and Haiku is about $1 / $5. So a tiny task on Opus output costs ~5x what the same task costs on Haiku. The same tiering exists across providers (OpenAI's gpt-5 mini/nano tiers, Gemini Flash vs Pro).

Before (wasteful):

# Every call hits the flagship, including a yes/no check
resp = client.messages.create(
    model="claude-opus-4-8",          # $5 / $25 per MTok
    max_tokens=5,
    messages=[{"role": "user",
               "content": f"Is this email spam? Reply yes or no.\n{email}"}],
)

After (lean):

resp = client.messages.create(
    model="claude-haiku-4-5",         # ~$1 / $5 per MTok
    max_tokens=5,
    messages=[{"role": "user",
               "content": f"Is this email spam? Reply yes or no.\n{email}"}],
)

Why it saves: a binary classifier produces a handful of output tokens and needs little reasoning. The smaller model returns the same answer here, and you pay roughly one-fifth the per-token rate. Latency usually drops too, since smaller models respond faster.

The practical move: audit your call sites and bucket them as trivial (classify, extract, format), moderate (summarize, draft), and hard (multi-step reasoning, tricky code, ambiguous judgment). Route the first two buckets down a tier and reserve the flagship for the last. Validate quality on a sample before rolling out — if a downgraded route regresses, bump it back up. You are not chasing a universal cheap model; you are right-sizing per task.

Applies to: Claude APIChatGPTOpenAI APIGemini

Don't just read it — build the habit

Get a fresh tip every morning

You're reading a free Beginner tip. Pro unlocks all 106 advanced tactics and sends a new one daily. Try it free for 7 days — then $9/mo, cancel anytime.

Start your 7-day free trial More free tips

More in Model Selection

🎚️Model Selection Often cheaper than escalating a tier on reasoning-limited hard tasks (qualitative — depends on the price gap and how many extra reasoning tokens the cheaper model spends)

Buy More Reasoning on the Cheap Model Before You Upgrade the Tier

When a cheap model stumbles on a hard task, the reflex is to jump to the frontier tier. Often the cheaper move is to keep the small model and turn its reasoning effort up — its per-token rate is so low it can brute-reason through the problem and still cost far less.

Intermediate Read →

🎚️Model Selection 40-70% when most queries are easy

Cascade: Try the Cheap Model First, Escalate Only When It Fails

Send every request to a small model first, programmatically check the answer, and only escalate to a frontier model when the cheap one falls short.

Intermediate Read →

🎚️Model Selection ~50% on input + output tokens for latency-tolerant workloads; you trade variable latency and occasional free-to-retry 429s

Set service_tier flex for Batch Prices on the Sync Endpoint

Add a single parameter to your OpenAI Responses or Chat Completions calls to pay Batch-API rates without restructuring anything into async batch jobs. You keep a normal synchronous request/response flow and give up only guaranteed speed.

Intermediate Read →