60 ways to spend fewer tokens
The 22 Beginner tips are free to read. The 38 advanced tactics unlock with Pro — plus a fresh tip in your inbox every morning.
Move Every Non-Urgent Job to the Batch API and Pay Half Price
If a job doesn't need an answer in the next few seconds, send it through the Batch API instead of the live endpoint. The exact same request costs half as much.
Ask for the Patch, Not the Whole File
When editing an existing file, tell the assistant to return only the changed lines as a diff or snippet instead of regenerating the entire file.
Drop the Screenshot Once the Model Has Read It
Images, PDFs, and attachments are charged as tokens and re-sent every turn in a multimodal thread. After the model has described or transcribed one, you usually don't need to keep sending the pixels.
Count Tokens Before You Hit Send, Not After the Bill Arrives
Measure a prompt's token count before sending it, so you catch oversized context while trimming it is still free.
Stop Paying Frontier Prices for Boilerplate Work
Most of your token spend is on tasks a small model handles perfectly. Match the model to the job instead of defaulting to your most expensive option for everything.
Return IDs and Enums, Not Sentences
For classification, routing, and selection tasks, have the model emit a short code, ID, or enum value instead of a polite sentence. The downstream code only needs the token, not the prose around it.
Freeze the Prefix: One Stray Timestamp Kills Your Whole Cache
Prompt caching is a prefix match. A single dynamic byte near the top of your prompt silently invalidates everything after it, so you pay full price every call without realizing it.
Stop Paying for 'Please' and 'I Was Wondering If'
Conversational filler and apologetic framing get tokenized and billed like any other text. Strip the social padding and lead with the instruction.
Cache the Context, Not Just the Answer
Cache the retrieved chunk set keyed by a normalized query, so popular or repeated questions skip the embedding call and vector search and reuse the same context block instead of rebuilding it every time.
Classify a Whole List in One Call, Not One Row at a Time
Send 20-50 items as a numbered list and get back a JSON array of labels, instead of paying for the same instruction prompt on every single row.
Run /clear Between Tasks in Claude Code Instead of Letting Context Pile Up
Claude Code resends the whole conversation every turn. Finishing one task and starting an unrelated one in the same thread means you keep paying for stale tool output and dead files.
Paste the Function, Not the Whole File
Most coding questions need 20-40 lines, not your 800-line file. Send the relevant slice plus a one-line note about the rest, and your input shrinks dramatically without hurting the answer.
Set Hard Spend Caps in the Provider Console
Configure provider-side usage limits, budgets, and alerts so a bug, a retry storm, or a leaked key cannot quietly run your bill far past a ceiling you set in advance.
Strip the Preamble: Ask for the Answer Only
Chat models love to restate your question, add caveats, and offer follow-ups. On high-volume tasks those wrapper tokens dominate the bill. Tell the model to return only the payload.
Share One Cached System Prompt Across All Your Users
A single per-user byte (name, ID, locale) in the system prompt forks the cache into one entry per user. Strip personalization out of the prefix so every user reads the same cached block.
Fence the Output Before It Wanders
State the constraints that usually trigger a do-over up front, so you don't pay for a second generation just to strip the preamble.
Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question
Dumping a full PDF or knowledge base into every prompt bills you for thousands of tokens the model never needed. Retrieve only the passages relevant to the question instead.
Deduplicate and Cache Identical Requests Before They Ever Hit the API
Real-world batches are full of repeats. Hash each request, send each unique prompt once, and fan the answer back out to every duplicate.
Add a .cursorignore So Cursor Stops Indexing Your node_modules
Cursor's @codebase and automatic context can pull in build artifacts, lockfiles, and vendored dependencies. A .cursorignore file keeps that noise out of every prompt.
Start a New Chat When the Topic Changes
Chat apps re-send your whole conversation with every message. When you switch tasks, the old turns become dead weight you keep paying to re-transmit — even with caching discounts.
Set max_tokens as a Hard Cost Ceiling, Not an Afterthought
Output tokens are the expensive half of most API bills. Setting an explicit max_tokens on every API call turns an open-ended cost into a known maximum.
Emit CSV, Not Markdown Tables
When the model returns rows of data your code will parse, ask for CSV instead of a Markdown table. The pipes, padding spaces, and separator row in Markdown are tokens that carry no data.
Use Targeted Retries with Backoff Instead of Blindly Re-Sending
Distinguish retryable errors from real failures, back off on rate limits, and resend only the failed items so you stop paying for accidental duplicate generations.
Scope Copilot Chat With #file Instead of @workspace
@workspace tells Copilot to search your entire repo and stuff retrieved snippets into the prompt. For targeted work, naming specific files with #file is leaner and usually more accurate.
Prune Bulky Tool Results Once You've Used Them
Tool and function-call outputs are the heaviest, most disposable thing in an agent transcript. Once the model has extracted what it needs, replace the raw result with a one-line stub before the next turn.
Log input_tokens and output_tokens on Every Call to Find Your Real Waste
Persist the usage object from every API response with a feature tag, so you can see exactly which feature and which token type is draining your budget.
Cascade: Try the Cheap Model First, Escalate Only When It Fails
Send every request to a small model first, programmatically check the answer, and only escalate to a frontier model when the cheap one falls short.
Use Stop Sequences to Cut Generation the Instant You Have Enough
A stop sequence halts generation the moment a chosen string appears — you stop paying for output the instant your data is complete, no truncation guesswork required.
Cache Your Tool Definitions, Not Just the System Prompt
Tool schemas render before the system prompt, so a non-deterministic tool list silently blocks the cache for everything after it. Sort and freeze the tool array to make tools cacheable.
Ask for the Diff, Not the Director's Cut
When revising a long artifact, request only the changed lines as a patch instead of having the model reprint the whole thing.
Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks
Naive fixed-length chunking splits ideas mid-sentence, forcing you to retrieve more chunks (and more overlap) to capture one answer. Chunk on semantic boundaries to send fewer tokens per query.
Stack Prompt Caching on Top of Your Batch Jobs for Compounding Savings
Batch requests support prompt caching. When every request shares a big instruction block or document, cache it once and the per-request cost collapses.
Use Inline Completion for Trivial Edits, Save Chat for Reasoning
Route boilerplate and one-liners through tab-style inline completion instead of opening a chat panel, which drags in your whole conversation and attached files.
Compress Long Threads with a Rolling Summary
Instead of dragging a 40-turn thread forward, periodically have the model write a compact state summary, then continue from that.
Give Each Feature a Token Budget and Enforce It with max_tokens
Set an explicit per-feature output ceiling instead of leaving max_tokens at a huge default, and reserve big budgets only for features that truly need them.
Don't Burn Reasoning Tokens on Tasks That Don't Reason
Model selection isn't just which model — it's which reasoning mode. Turn thinking down or off for straightforward work and reserve deep reasoning for genuinely hard problems.
Abort the Stream the Moment You Have Enough
When streaming, close the connection as soon as the part you care about arrives instead of letting the model run to its natural stop. You only pay for tokens actually generated before the abort.
Know Your Minimum: Short Prompts Silently Refuse to Cache
Below a model-specific token floor, a cache_control marker does nothing — no error, just a full-price bill. Know the floor before you rely on caching.
Two Sharp Examples Beat Eight Bloated Ones
Few-shot examples are usually the heaviest part of a prompt. Trim each one to the minimum that demonstrates the pattern, and use the fewest that hold accuracy.
Filter by Metadata Before You Search
Attach structured metadata to chunks and apply WHERE-style filters before the vector search runs, so you embed and rank a smaller candidate set and stuff fewer off-topic chunks into the prompt.
Tag Every Call So You Know Which Feature Burns Tokens
Attach feature, user, and environment labels to every API call so your bill breaks down by what actually drives cost instead of one undifferentiated total.
Order Your Prompt by Volatility: Tools, then System, then the Question
The model renders tools, then system, then messages. Put your most stable content first and your most volatile content last, or your breakpoints cache nothing reusable.
Reference Your Data, Don't Re-Paste It Every Turn
In chat UIs and stateless APIs, re-pasting the same document or spec into every message silently multiplies your input cost. Send it once and refer back.
Add a Reranker and a Hard Token Budget: Retrieve 20 Candidates, Send Only the Best 3
Vector similarity is approximate, so people inflate top-k to avoid missing the answer. A cheap reranking pass lets you fetch many candidates but send only the few that matter to the expensive LLM.
Run an Async Queue with a Concurrency Cap Instead of Firing All at Once
Push jobs through a bounded worker pool so you saturate your rate limit without tripping it, eliminating the retry storms and tier upgrades that quietly inflate cost.
Turn Off Auto Codebase Context When You Don't Need It
Features like 'codebase' auto-retrieval and automatic open-file context silently attach extra tokens to every message. Switch them off for narrow tasks and attach context explicitly.
Keep Working State in a File, Not in the Conversation
On long tasks, let the model write its plan, findings, and decisions to an external scratchpad and re-read only the slice it needs, instead of accumulating all of it as ever-growing conversation history.
Measure Cost-Per-Success, Not Cost-Per-Call
Compare two prompts or models on cost divided by successful outcomes, including retries and rework, so you stop chasing cheap calls that quietly fail and get redone.
Use a Big Model as the Planner, Small Models as the Workers
In agentic and multi-step pipelines, reserve the frontier model for orchestration and hard reasoning, and delegate bulk subtasks (search, read, extract) to a cheaper model.
Design a Compact Output Schema (and Skip the Pretty-Printing)
When you need structured data, the shape you ask for directly determines token count. Short keys, no markdown scaffolding, and minified output cut tokens on every response — and the input echo if you loop.
Match TTL to Traffic, and Pre-Warm Before the First User Hits
The default 5-minute cache evaporates between bursts. Choose 5-minute vs 1-hour TTL by your traffic gaps, and pre-warm at startup to kill first-request latency.
Factor Your System Prompt and Cap the Output
Move stable rules into a reusable, cacheable system prompt once, and constrain the response so the model can't ramble — output tokens usually cost more per token than input.
Embed and Summarize Once: Stop Re-Tokenizing the Same Documents on Every Query
Re-embedding unchanged documents and re-summarizing the same sources on every run quietly burns tokens. Compute these artifacts once, persist them, and reuse provider-side prompt caching for stable context.
Collapse Many Tiny Calls into One Structured Request
Ten one-item calls re-send your instructions ten times. Batch the items into a single request with a structured-output schema and pay the overhead once.
Pipe Logs Through head/grep Before Pasting Them Into an Assistant
A 4,000-line stack trace or verbose build log is mostly repetition the model doesn't need. Extract the signal first; pasting the whole thing is the single most wasteful coding-assistant habit.
Build a Sliding Window with a Cached, Stable Prefix
For API apps, cap history at the last N turns and put unchanging instructions first so prompt caching can discount the prefix.
Wire Up Spend Alerts and a Token Circuit Breaker Before You Need Them
Combine provider budget alerts with an in-app token meter that hard-stops a feature once it blows past its expected per-period budget.
Read the Usage Block to Prove Your Cache Actually Hits
A cache_control marker that silently never hits looks identical to one that works — until you read the three usage token fields and compute your real hit rate.
Try Zero-Shot Before You Pay for Examples
Examples are recurring input tokens on every call — test whether a crisp instruction does the job before you attach them by default.
Go Hybrid to Retrieve Less
Combine a keyword (BM25) score with the vector score so exact-term matches rank first, letting you lower top-k because the right chunk lands near the top instead of being padded around with semantic near-misses.
Like what you see?
Get a fresh one in your inbox — weekly free, daily on Pro.