Tips library — TokenWise

⚙️Batching & Automation ~50% on input + output tokens

Move Every Non-Urgent Job to the Batch API and Pay Half Price

If a job doesn't need an answer in the next few seconds, send it through the Batch API instead of the live endpoint. The exact same request costs half as much.

Beginner 1 min Read →

💻Coding Assistants Often cuts output tokens 40-70% on edits to large files, varies by file size

Ask for the Patch, Not the Whole File

When editing an existing file, tell the assistant to return only the changed lines as a diff or snippet instead of regenerating the entire file.

Beginner 1 min Read →

🧠Context Management Images can be 1,000-2,000+ tokens each; removing stale ones cuts that per turn

Drop the Screenshot Once the Model Has Read It

Images, PDFs, and attachments are charged as tokens and re-sent every turn in a multimodal thread. After the model has described or transcribed one, you usually don't need to keep sending the pixels.

Beginner 2 min Read →

📊Measurement & Budgeting 10-30% on prompts you would have sent bloated

Count Tokens Before You Hit Send, Not After the Bill Arrives

Measure a prompt's token count before sending it, so you catch oversized context while trimming it is still free.

Beginner 1 min Read →

🎚️Model Selection 60-80% on routed traffic

Stop Paying Frontier Prices for Boilerplate Work

Most of your token spend is on tasks a small model handles perfectly. Match the model to the job instead of defaulting to your most expensive option for everything.

Beginner 1 min Read →

📐Output Control Shrinks classification and routing outputs substantially, frequently 5-15x fewer output tokens per call

Return IDs and Enums, Not Sentences

For classification, routing, and selection tasks, have the model emit a short code, ID, or enum value instead of a polite sentence. The downstream code only needs the token, not the prose around it.

Beginner 2 min Read →

♻️Prompt Caching & Reuse up to 90% on the cached portion

Freeze the Prefix: One Stray Timestamp Kills Your Whole Cache

Prompt caching is a prefix match. A single dynamic byte near the top of your prompt silently invalidates everything after it, so you pay full price every call without realizing it.

Beginner 1 min Read →

✍️Prompt Engineering 10-25% on short prompts

Stop Paying for 'Please' and 'I Was Wondering If'

Conversational filler and apologetic framing get tokenized and billed like any other text. Strip the social padding and lead with the instruction.

Beginner 1 min Read →

🔎Retrieval & RAG Eliminates the embedding API call and vector search on cache hits. The saved cost tracks your hit rate; it cuts embedding/retrieval spend, not the tokens sent to the LLM.

Cache the Context, Not Just the Answer

Cache the retrieved chunk set keyed by a normalized query, so popular or repeated questions skip the embedding call and vector search and reuse the same context block instead of rebuilding it every time.

Beginner 2 min Read →

⚙️Batching & Automation Often 40-70% fewer input tokens on bulk classification, varies with prompt size

Classify a Whole List in One Call, Not One Row at a Time

Send 20-50 items as a numbered list and get back a JSON array of labels, instead of paying for the same instruction prompt on every single row.

Beginner 1 min Read →

💻Coding Assistants often 30-60% on long sessions

Run /clear Between Tasks in Claude Code Instead of Letting Context Pile Up

Claude Code resends the whole conversation every turn. Finishing one task and starting an unrelated one in the same thread means you keep paying for stale tool output and dead files.

Beginner 1 min Read →

🧠Context Management 50-90% on file-heavy prompts

Paste the Function, Not the Whole File

Most coding questions need 20-40 lines, not your 800-line file. Send the relevant slice plus a one-line note about the rest, and your input shrinks dramatically without hurting the answer.

Beginner 2 min Read →

📊Measurement & Budgeting Prevents runaway-loop and leaked-key blowups; bounds worst-case spend rather than reducing normal usage

Set Hard Spend Caps in the Provider Console

Configure provider-side usage limits, budgets, and alerts so a bug, a retry storm, or a leaked key cannot quietly run your bill far past a ceiling you set in advance.

Beginner 3 min Read →

📐Output Control Often 30-60% fewer output tokens on short tasks

Strip the Preamble: Ask for the Answer Only

Chat models love to restate your question, add caveats, and offer follow-ups. On high-volume tasks those wrapper tokens dominate the bill. Tell the model to return only the payload.

Beginner 1 min Read →

♻️Prompt Caching & Reuse Cache reads run roughly 0.1x of base input price; the more users hit the same prefix, the closer your shared instructions get to free

Share One Cached System Prompt Across All Your Users

A single per-user byte (name, ID, locale) in the system prompt forks the cache into one entry per user. Strip personalization out of the prefix so every user reads the same cached block.

Beginner 1 min Read →

✍️Prompt Engineering Roughly 20-40% fewer follow-up turns on formatting-sensitive tasks, in our experience

Fence the Output Before It Wanders

State the constraints that usually trigger a do-over up front, so you don't pay for a second generation just to strip the preamble.

Beginner 2 min Read →

🔎Retrieval & RAG 70-95% on document-heavy prompts

Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question

Dumping a full PDF or knowledge base into every prompt bills you for thousands of tokens the model never needed. Retrieve only the passages relevant to the question instead.

Beginner 2 min Read →

⚙️Batching & Automation Varies; commonly 20-60% on duplicate-heavy workloads

Deduplicate and Cache Identical Requests Before They Ever Hit the API

Real-world batches are full of repeats. Hash each request, send each unique prompt once, and fan the answer back out to every duplicate.

Beginner 1 min Read →

💻Coding Assistants 10-30% on context-heavy chats

Add a .cursorignore So Cursor Stops Indexing Your node_modules

Cursor's @codebase and automatic context can pull in build artifacts, lockfiles, and vendored dependencies. A .cursorignore file keeps that noise out of every prompt.

Beginner 1 min Read →

🧠Context Management Trims re-sent history; often 20-60% fewer input tokens per turn after a topic switch

Start a New Chat When the Topic Changes

Chat apps re-send your whole conversation with every message. When you switch tasks, the old turns become dead weight you keep paying to re-transmit — even with caching discounts.

Beginner 2 min Read →

📐Output Control Caps runaway costs; output tokens are typically 3-5x the input price

Set max_tokens as a Hard Cost Ceiling, Not an Afterthought

Output tokens are the expensive half of most API bills. Setting an explicit max_tokens on every API call turns an open-ended cost into a known maximum.

Beginner 1 min Read →

📐Output Control Trims tabular output noticeably, commonly 15-40% fewer tokens versus a Markdown table

Emit CSV, Not Markdown Tables

When the model returns rows of data your code will parse, ask for CSV instead of a Markdown table. The pipes, padding spaces, and separator row in Markdown are tokens that carry no data.

Beginner 2 min Read →

⚙️Batching & Automation 🔒 Pro

Fetch the Data the Agent Always Needs Before the Loop Starts

When an agent will always need the same data, making it fetch that data through a tool call costs tokens for the decision, the tool schema, and the round-trip. Gather it with a plain script before the agent starts and hand it the file.

Intermediate 3 min Unlock →

💻Coding Assistants 🔒 Pro

Return only changed lines when an agent re-reads a file

Coding agents re-read the same file many times per session, paying full-file token cost each time even when almost nothing changed. Serve a delta — only the lines that changed since the agent last saw the file — instead of the whole thing, and replace re-reads for structure with a signatures-only map.

Intermediate 3 min Unlock →

🧠Context Management 🔒 Pro

Run /context and Strip MCP Tool Schemas You Never Call

Every connected MCP server injects its full tool schemas into context on every single turn, whether you call those tools or not. Run /context to see the bloat, then disable idle servers and trim tools you never use.

Intermediate 4 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Cap the loops by count, not tokens: spawn and search limits in Claude Code

Claude Code 2.1.212 added session caps on subagent spawns and WebSearch calls, both defaulting to a non-binding 200. Lower them to match your actual workload so a mis-planned agent hits a hard stop at 12 searches instead of quietly burning a full context window and the spend behind it.

Intermediate 3 min Unlock →

🎚️Model Selection 🔒 Pro

Buy More Reasoning on the Cheap Model Before You Upgrade the Tier

When a cheap model stumbles on a hard task, the reflex is to jump to the frontier tier. Often the cheaper move is to keep the small model and turn its reasoning effort up — its per-token rate is so low it can brute-reason through the problem and still cost far less.

Intermediate 3 min Unlock →

📐Output Control 🔒 Pro

Use Stop Sequences to Cut Generation the Instant You Have Enough

A stop sequence halts generation the moment a chosen string appears — you stop paying for output the instant your data is complete, no truncation guesswork required.

Intermediate 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Cache Your Tool Definitions, Not Just the System Prompt

Tool schemas render before the system prompt, so a non-deterministic tool list silently blocks the cache for everything after it. Sort and freeze the tool array to make tools cacheable.

Intermediate 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Ask for the Diff, Not the Director's Cut

When revising a long artifact, request only the changed lines as a patch instead of having the model reprint the whole thing.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks

Naive fixed-length chunking splits ideas mid-sentence, forcing you to retrieve more chunks (and more overlap) to capture one answer. Chunk on semantic boundaries to send fewer tokens per query.

Intermediate 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Add a Relevance Gate: The Cheapest LLM Call Is the One You Don't Make

A cheaper model and a tighter prompt still pay for a call that never needed to happen. Put a deterministic, non-LLM precheck in front of the agent and skip the whole invocation when the input plainly doesn't need reasoning.

Intermediate 2 min Unlock →

💻Coding Assistants 🔒 Pro

Scope Copilot Chat With #file Instead of @workspace

@workspace tells Copilot to search your entire repo and stuff retrieved snippets into the prompt. For targeted work, naming specific files with #file is leaner and usually more accurate.

Intermediate 1 min Unlock →

🧠Context Management 🔒 Pro

Compact Context by Deleting Dead Weight, Not Summarizing It

Before a turn, run a mechanical pass that deletes low-signal tokens (boilerplate, repeated formatting, verbose metadata) while leaving every surviving sentence character-for-character identical. You cut the re-billed context without the paraphrase errors a summary introduces.

Intermediate 3 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Log input_tokens and output_tokens on Every Call to Find Your Real Waste

Persist the usage object from every API response with a feature tag, so you can see exactly which feature and which token type is draining your budget.

Intermediate 1 min Unlock →

🎚️Model Selection 🔒 Pro

Cascade: Try the Cheap Model First, Escalate Only When It Fails

Send every request to a small model first, programmatically check the answer, and only escalate to a frontier model when the cheap one falls short.

Intermediate 2 min Unlock →

📐Output Control 🔒 Pro

Abort the Stream the Moment You Have Enough

When streaming, close the connection as soon as the part you care about arrives instead of letting the model run to its natural stop. You only pay for tokens actually generated before the abort.

Intermediate 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Carry Gemini Conversation State Server-Side Instead of Resending the Transcript

A multi-turn Gemini agent that rebuilds its contents list re-sends the whole transcript at full input price every turn. The Interactions API holds history server-side; pass previous_interaction_id, send only the new turn, and let the reused prefix hit implicit caching.

Intermediate 2 min Unlock →

✍️Prompt Engineering 🔒 Pro

Two Sharp Examples Beat Eight Bloated Ones

Few-shot examples are usually the heaviest part of a prompt. Trim each one to the minimum that demonstrates the pattern, and use the fewest that hold accuracy.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Filter by Metadata Before You Search

Attach structured metadata to chunks and apply WHERE-style filters before the vector search runs, so you embed and rank a smaller candidate set and stuff fewer off-topic chunks into the prompt.

Intermediate 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Use Targeted Retries with Backoff Instead of Blindly Re-Sending

Distinguish retryable errors from real failures, back off on rate limits, and resend only the failed items so you stop paying for accidental duplicate generations.

Intermediate 2 min Unlock →

💻Coding Assistants 🔒 Pro

Use Inline Completion for Trivial Edits, Save Chat for Reasoning

Route boilerplate and one-liners through tab-style inline completion instead of opening a chat panel, which drags in your whole conversation and attached files.

Intermediate 2 min Unlock →

🧠Context Management 🔒 Pro

Convert Fetched Pages to Markdown Before They Hit the Agent's Context

Browse and research agents that feed raw HTML into context pay for tag soup, scripts, and styling the model never uses. Strip each fetched page to clean Markdown at the tool-result boundary first.

Intermediate 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Give Each Feature a Token Budget and Enforce It with max_tokens

Set an explicit per-feature output ceiling instead of leaving max_tokens at a huge default, and reserve big budgets only for features that truly need them.

Intermediate 1 min Unlock →

🎚️Model Selection 🔒 Pro

Set service_tier flex for Batch Prices on the Sync Endpoint

Add a single parameter to your OpenAI Responses or Chat Completions calls to pay Batch-API rates without restructuring anything into async batch jobs. You keep a normal synchronous request/response flow and give up only guaranteed speed.

Intermediate 2 min Unlock →

📐Output Control 🔒 Pro

Serialize Uniform Tables as TOON, Not JSON

When you feed the model a uniform array of records, JSON repeats every field name, brace, and quote on every row. TOON declares the field names once in a header and emits bare value rows, so you pay for values, not for structure you already stated.

Intermediate 2 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Know Your Minimum: Short Prompts Silently Refuse to Cache

Below a model-specific token floor, a cache_control marker does nothing — no error, just a full-price bill. Know the floor before you rely on caching.

Intermediate 1 min Unlock →

✍️Prompt Engineering 🔒 Pro

Reference Your Data, Don't Re-Paste It Every Turn

In chat UIs and stateless APIs, re-pasting the same document or spec into every message silently multiplies your input cost. Send it once and refer back.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Add a Reranker and a Hard Token Budget: Retrieve 20 Candidates, Send Only the Best 3

Vector similarity is approximate, so people inflate top-k to avoid missing the answer. A cheap reranking pass lets you fetch many candidates but send only the few that matter to the expensive LLM.

Intermediate 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Stack Prompt Caching on Top of Your Batch Jobs for Compounding Savings

Batch requests support prompt caching. When every request shares a big instruction block or document, cache it once and the per-request cost collapses.

Intermediate 1 min Unlock →

🧠Context Management 🔒 Pro

Load Skill Names Up Front, Bodies Only When Triggered

A big library of specialized agent instructions billed on every turn is mostly dead weight. Structure each capability as a Skill so only its name and one-line description load at startup, and the heavy instruction body is read into context only when that skill actually fires.

Intermediate 3 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Keep agent threads under the long-context price cliff

GPT-5.6 switches to a higher long-context rate once a single request crosses ~272k tokens. Agent harnesses that quietly raised their window push routine threads past that line and double the per-token bill. Cap each thread below the cliff — compact or start fresh before you cross — and you keep paying the base rate.

Intermediate 3 min Unlock →

🎚️Model Selection 🔒 Pro

Don't Burn Reasoning Tokens on Tasks That Don't Reason

Model selection isn't just which model — it's which reasoning mode. Turn thinking down or off for straightforward work and reserve deep reasoning for genuinely hard problems.

Intermediate 2 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Order Your Prompt by Volatility: Tools, then System, then the Question

The model renders tools, then system, then messages. Put your most stable content first and your most volatile content last, or your breakpoints cache nothing reusable.

Intermediate 1 min Unlock →

🧠Context Management 🔒 Pro

Prune Bulky Tool Results Once You've Used Them

Tool and function-call outputs are the heaviest, most disposable thing in an agent transcript. Once the model has extracted what it needs, replace the raw result with a one-line stub before the next turn.

Intermediate 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Tag Every Call So You Know Which Feature Burns Tokens

Attach feature, user, and environment labels to every API call so your bill breaks down by what actually drives cost instead of one undifferentiated total.

Intermediate 1 min Unlock →

🧠Context Management 🔒 Pro

Compress Long Threads with a Rolling Summary

Instead of dragging a 40-turn thread forward, periodically have the model write a compact state summary, then continue from that.

Intermediate 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Run an Async Queue with a Concurrency Cap Instead of Firing All at Once

Push jobs through a bounded worker pool so you saturate your rate limit without tripping it, eliminating the retry storms and tier upgrades that quietly inflate cost.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Call Tools as Code, Not by Injecting Every Tool Definition

Default MCP setups load every tool's JSON schema into the prompt on every turn and round-trip each intermediate result back through the model. Let the agent search for the few tools it needs and orchestrate them in code instead.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Abort the Agent the Moment It Stops Making Progress

Count caps, task budgets, and request de-duplication all miss the same failure: an agent that keeps issuing the same tool call, or advances a turn with no state change, and re-sends the full transcript every time. Add a duplicate/no-progress detector that halts the run the instant a loop is detected, so you stop paying quadratic re-prefill on a trajectory that cannot converge.

Advanced 4 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Measure Cost-Per-Success, Not Cost-Per-Call

Compare two prompts or models on cost divided by successful outcomes, including retries and rework, so you stop chasing cheap calls that quietly fail and get redone.

Advanced 2 min Unlock →

🎚️Model Selection 🔒 Pro

The Harness Effect: Fix Your Orchestration Before You Reach for a Cheaper Model

The reflex when an agent bill climbs is to downgrade the model - a flat, one-time per-token discount that costs quality. A controlled study (The Harness Effect, arXiv 2607.06906) held the model fixed and changed only the orchestration layer, cutting tokens/task 38% and cost 41%. Audit and fix the harness - cache-shape, compaction, offload, suspend-don't-poll, failure governance - before you reach for a cheaper model.

Advanced 4 min Unlock →

📐Output Control 🔒 Pro

Cap runaway agents with Anthropic task budgets, not just max_tokens

A per-agent task_budget gives Claude one advisory token countdown across the ENTIRE loop (thinking + tool calls + tool results + output), so it paces itself and finishes gracefully instead of over-exploring or cutting off mid-action and forcing an expensive retry.

Advanced 3 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Compact When the Cache Is Already Cold, Not When the Transcript Gets Big

Most harnesses compact when the transcript crosses a token count, which destroys a warm prefix you already paid to write. Compact instead at the moment a message arrives after the cache TTL has expired — you have to rewrite the whole prefix at write rates anyway, so the forced write simply gets smaller.

Advanced 5 min Unlock →

✍️Prompt Engineering 🔒 Pro

Factor Your System Prompt and Cap the Output

Move stable rules into a reusable, cacheable system prompt once, and constrain the response so the model can't ramble — output tokens usually cost more per token than input.

Advanced 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Embed and Summarize Once: Stop Re-Tokenizing the Same Documents on Every Query

Re-embedding unchanged documents and re-summarizing the same sources on every run quietly burns tokens. Compute these artifacts once, persist them, and reuse provider-side prompt caching for stable context.

Advanced 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Compile a Repeated Skill's Stable Steps Into Code, Keep the Model Only for Judgment

Take a working natural-language agent skill, collect its execution traces, and permanently move the steps that have stabilized into a fixed procedure (search, filter, dedupe, format) into plain code, reserving model calls only for the genuinely semantic steps.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Turn Off Auto Codebase Context When You Don't Need It

Features like 'codebase' auto-retrieval and automatic open-file context silently attach extra tokens to every message. Switch them off for narrow tasks and attach context explicitly.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Annotate Context as Dependency-Linked Episodes, Then Evict Graduated

Both at-budget and fixed-interval compaction throw away context blindly. Instead have the agent tag its trajectory as typed episodes with explicit dependency links, then deterministically evict whole episodes only when nothing still-live depends on them — a graduated lifecycle that stays in budget over arbitrarily long runs.

Advanced 3 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Wire Up Spend Alerts and a Token Circuit Breaker Before You Need Them

Combine provider budget alerts with an in-app token meter that hard-stops a feature once it blows past its expected per-period budget.

Advanced 2 min Unlock →

🎚️Model Selection 🔒 Pro

Pin effort on the Managed Agents agent resource, not the session

In the Managed Agents API, an `effort` level set inside a per-session model override is silently ignored, and changing an agent's model `id` silently resets a pinned effort to the new model's default. Set effort on the agent resource, and re-assert it on every model swap.

Advanced 5 min Unlock →

📐Output Control 🔒 Pro

Design a Compact Output Schema (and Skip the Pretty-Printing)

When you need structured data, the shape you ask for directly determines token count. Short keys, no markdown scaffolding, and minified output cut tokens on every response — and the input echo if you loop.

Advanced 1 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Content-Hash Your Wire Blocks to Find the Exact One That Breaks the Cache

When your cache silently stops hitting, the usage block proves it broke but not WHERE. Hash every block you put on the wire and diff turn-to-turn: the single MODIFIED block above your breakpoint is the byte that cost you the cache.

Advanced 3 min Unlock →

✍️Prompt Engineering 🔒 Pro

Try Zero-Shot Before You Pay for Examples

Examples are recurring input tokens on every call — test whether a crisp instruction does the job before you attach them by default.

Advanced 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Go Hybrid to Retrieve Less

Combine a keyword (BM25) score with the vector score so exact-term matches rank first, letting you lower top-k because the right chunk lands near the top instead of being padded around with semantic near-misses.

Advanced 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Collapse Many Tiny Calls into One Structured Request

Ten one-item calls re-send your instructions ten times. Batch the items into a single request with a structured-output schema and pay the overhead once.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Estimate, Execute, Expand: Read the Minimum, Widen Only on Failure

Most coding agents front-load context — open a dozen files, grep the whole repo, pull broad retrieval — before making a one-line edit, then re-bill all of it every turn. Estimate the task's real scope, execute the minimal path, and only widen context when a verification step actually fails.

Advanced 4 min Unlock →

🧠Context Management 🔒 Pro

Let the API compact the conversation server-side (compact-2026-01-12)

On long agentic Claude runs, every turn re-bills the entire growing history. Anthropic's server-side compaction auto-summarizes the conversation into a compaction block at a token trigger, then drops all blocks before it on the next request so you pay for the summary plus recent turns instead of the full transcript.

Advanced 3 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Give Your Agent Fleet a Real Dollar Ceiling: --max-budget-usd Now Halts Background Subagents

Before Claude Code 2.1.216, hitting --max-budget-usd denied new subagent spawns but let already-running background subagents keep burning past the cap. 2.1.216 halts them at the ceiling and adds CLAUDE_CODE_MAX_CONCURRENT_SUBAGENTS (default 20) to bound how many contexts draw tokens in parallel. Upgrade and set both to give a long fan-out a true spend ceiling plus a parallelism throttle.

Advanced 3 min Unlock →

🎚️Model Selection 🔒 Pro

Use a Big Model as the Planner, Small Models as the Workers

In agentic and multi-step pipelines, reserve the frontier model for orchestration and hard reasoning, and delegate bulk subtasks (search, read, extract) to a cheaper model.

Advanced 2 min Unlock →

📐Output Control 🔒 Pro

Drop Search-Result Blocks the Agent Already Digested in Code

When Claude uses dynamic filtering to process web-search results in code, set response_inclusion: excluded so the raw search blocks it already consumed are dropped from the API response instead of echoed back as output tokens.

Advanced 2 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

On GPT-5.6, Caching Is No Longer Free: Place the Breakpoint Yourself and Name the Key

GPT-5.6 charges 1.25x to write a cache entry — the first generation where caching a prefix can cost more than not caching it. Explicit breakpoints keep per-request text out of the write, and prompt_cache_key keeps matching reliable.

Advanced 4 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Cut Agent-Memory Tokens With Single-Pass Writes and Multi-Signal Recall

The common memory pipeline burns three LLM calls to store one fact and dumps a fat slice of the store on read. Collapse writes to a single ADD-only call, store agent inferences as first-class memories, and recall with fused multi-signal scoring that returns only what the turn needs.

Advanced 2 min Unlock →

⚙️Batching & Automation 🔒 Pro

Record a Successful Agent Run Once, Then Replay the Tool Trace With Zero Inference

A scheduled agent re-plans the same fixed workflow every run, paying the model to re-decide a tool sequence it already decided. Record the trace of one successful run, template the params that vary, and replay the steps deterministically without calling the LLM.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Anchor edits with a content hash, not the echoed old string

Search/replace edit tools force the model to re-type the exact original code as a match anchor. Swap that echoed 'old string' for a short content hash the read step already handed back, and the model generates far fewer output tokens per edit — the expensive side of the bill.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Evict Stale Thinking Blocks on Long Reasoning Agents Without Torching Your Cache

On Opus 4.5+/Sonnet 4.6+, extended-thinking blocks from every prior turn are kept in context by default — dead reasoning you re-pay for on every turn of a long agent loop. The clear_thinking strategy evicts them; batch the clears so the cache re-write pays off.

Advanced 3 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Score Agent Runs in Effective Tokens, So Your "20% Cut" Is a Real 20%

Raw token counts treat a cheap cache-read token and an expensive flagship output token as equal — across token classes and tiers they can differ by 50x or more in price. Roll every run up into one price-weighted Effective Token number so a reported X% reduction maps to a real X% spend cut.

Advanced 4 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Steer a running agent with a mid-conversation system message, not a rewritten system prompt

On Claude Opus 4.8 you can append a role:"system" message to the messages array to change instructions, permissions, or budgets mid-run. It sits after the cached history, so the whole prefix still hits the cache instead of being re-billed at full input price.

Advanced 3 min Unlock →

⚙️Batching & Automation 🔒 Pro

Run the open-ended search phase last, over a frozen list of claims

Putting the unbounded research step first lets agents fan out over open questions and mint work faster than the pipeline can retire it — one team burned a full monthly allowance in 30 minutes and shipped nothing. Move it last, after a bounded step has produced a finite claim list, so your agent count is known before you spend rather than discovered when the budget dies.

Advanced 5 min Unlock →

💻Coding Assistants 🔒 Pro

Minify Source Code Before It Enters the Agent's Context

In a state-in-context coding agent, the source files pasted into every turn are the single biggest token contributor. Strip whitespace, comments, and needless formatting from that payload before it reaches the model and you cut input tokens dramatically — but only if you accept a measurable accuracy tradeoff and restore names on the way out.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Compress Tool Output Only If the Agent Would Take the Same Next Action

Shrinking tool observations is only safe if the agent behaves identically on the smaller version. Accept a compressed observation only when it induces the exact same next action as the raw one, and reject it otherwise.

Advanced 3 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Optimize the Bill, Not the Token Count — Caching Can Make Compression Backfire

With prompt caching on, a stable prefix bills at a fraction of base input, so the tokens you change cost far more than the tokens you keep. Trimming context can raise the actual bill and tank task success — measure success-adjusted billed cost instead of raw token count.

Advanced 3 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Rewrite Volatile Prefix Fields to Placeholders So Cold Starts Hit the Cache

A fresh timestamp, request ID, or path at the top of your prefix is never byte-identical twice, so every new session misses the cache. Rewrite those volatile fields to stable placeholders and restore them out of band, turning cold pre-fills into cache hits.

Advanced 3 min Unlock →

⚙️Batching & Automation 🔒 Pro

Make "Nothing To Do" Cost Zero Tokens: Terminal Dispositions and Idempotent Wake IDs

A cron-woken agent that loads its whole context before discovering there is no work pays full price to conclude nothing happened — and if the run never records a terminal outcome, the scheduler wakes it again. Answer both questions deterministically, before any model call.

Advanced 6 min Unlock →

💻Coding Assistants 🔒 Pro

Give Agents Predicate-Flag CLIs and Pre-Filtered Command Output

Coding agents burn tokens reading verbose --help text and giant JSON dumps, then burn more composing jq/wc/python pipelines to interpret them — and under a stateless API the dump is re-billed on every turn. Shape command output into a compact, answer-shaped form before it ever reaches the context window.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Update Codex CLI So the Skill Catalog Budgets Itself Instead of Eating Your Window

Every skill and plugin you install adds a metadata row — name, description, and an absolute host path — that Codex concatenates into context on every turn, whether or not the skill ever fires. As the library grows, that catalog silently eats the model's window on long runs. Codex CLI 0.146.0 (July 29 2026) treats the catalog as a budgeted resource: it compacts host skill paths under metadata pressure and warns when the catalog exceeds its context budget. Update, then heed the warning and tune your loaded set instead of installing every skill wholesale.

Advanced 4 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Cap Each Agent's Spend and Degrade the Model at the Soft Limit

Autonomous agents fan out into tool calls, sub-agents, and retries, so after-the-fact dashboards never actually bound spend. Put a pre-call gate in front of every request that enforces a per-agent hard ceiling and quietly drops to a cheaper model at a soft threshold instead of failing hard.

Advanced 3 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Match TTL to Traffic, and Pre-Warm Before the First User Hits

The default 5-minute cache evaporates between bursts. Choose 5-minute vs 1-hour TTL by your traffic gaps, and pre-warm at startup to kill first-request latency.

Advanced 1 min Unlock →

⚙️Batching & Automation 🔒 Pro

Wake the Agent on a File Cursor, Not on a Heartbeat

Anything that isn't a directly pushed message — cron output, log lines, inbox changes, external feeds — is only noticed at heartbeat granularity, and every heartbeat burns a full LLM turn even when nothing happened. Hold byte-offset cursors on the plaintext files those sources already write, wake the agent only on an actual append, and coalesce bursts into one turn. A polling loop's cost scales with wall-clock time; an event loop's cost scales with real events.

Advanced 7 min Unlock →

💻Coding Assistants 🔒 Pro

Quarantine Heavy Reads in a Sub-Agent, Return Only the Summary

Delegate token-heavy exploration — large-file reads, log digging, multi-file research — to a sub-agent with its own throwaway context window, so only a distilled summary lands in the main thread instead of the entire exploration trail you'd otherwise re-bill every turn.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Put a Content-Aware Compression Proxy Between Your Agent and the Model

Drop a compression layer on the agent-to-model boundary that detects what each blob is (JSON, code, prose) and routes it to a format-specialized lossy compressor, while handing the model a tool to restore any blob to the original on demand. It shapes input tokens before they hit the window, uniformly, regardless of which framework produced them.

Advanced 2 min Unlock →

📊Measurement & Budgeting 🔒 Pro

Run Every Agent Thread Off One Shared Token Ledger That Aborts at the Budget Line

Per-agent caps still let a multi-agent run overspend in aggregate: ten threads each "under budget" blow past the total, and a sub-agent or compaction call drawing context after the work is done is invisible to any single counter. Settle every thread, sub-agent, and compaction call against ONE shared ledger, abort each turn at its next usage-accounting boundary when the ledger hits zero, and feed the remaining balance back into the model's context so it steers its own spend.

Advanced 4 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Read the Usage Block to Prove Your Cache Actually Hits

A cache_control marker that silently never hits looks identical to one that works — until you read the three usage token fields and compute your real hit rate.

Advanced 1 min Unlock →

💻Coding Assistants 🔒 Pro

Pipe Logs Through head/grep Before Pasting Them Into an Assistant

A 4,000-line stack trace or verbose build log is mostly repetition the model doesn't need. Extract the signal first; pasting the whole thing is the single most wasteful coding-assistant habit.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Delta-Debug Your Skill Definitions Down to What Actually Fires

Hand-written Skill / AGENTS.md files carry two kinds of fat: bloated routing descriptions that load on every turn, and bodies stuffed with background and examples the model rarely needs. Run an automated pass that delta-debugs each routing line to the minimum that still triggers correctly, then splits the body into core rules vs on-demand reference files.

Advanced 4 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Stop Paying Twice for Your Cache When a Refusal Forces a Model Switch

Prompt caches are per-model, so retrying a refused request on a fallback model normally re-writes the whole cached prefix at the write rate. Anthropic's fallback credit re-bills it as a cache read instead.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Tee the Agent's Own Reads to a Cheap Live-Memory Model

Instead of letting your premium coding agent re-read the same files into its own context over and over, tee the reads it already performs to an always-on cheap model that distills them into a queryable memory ledger. The expensive model then asks one question instead of re-opening files.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Keep Working State in a File, Not in the Conversation

On long tasks, let the model write its plan, findings, and decisions to an external scratchpad and re-read only the slice it needs, instead of accumulating all of it as ever-growing conversation history.

Advanced 2 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Reanchor the Cache in Long Agent Turns to Beat the 20-Block Lookback

In long agentic runs, accumulating tool blocks push your cache breakpoint out of Anthropic's 20-block lookback window, so caching silently stops firing. Add a second, trailing breakpoint to keep it alive.

Advanced 2 min Unlock →

💻Coding Assistants 🔒 Pro

Tier Sub-Agent Models by Depth: Opus at the Root, Haiku at the Leaves

Claude Code sub-agents can now spawn their own sub-agents up to 5 levels deep. If every frame inherits your flagship model, a deep tree multiplies Opus-priced tokens. Make the model cheaper as depth increases.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Fork the Session Instead of Re-Spawning a Sub-Agent That Starts From Zero

On a multi-step agent job that works over one shared codebase, spawning a fresh sub-agent per step makes each one re-find, re-search, and re-read everything from scratch at full input price. When a phase builds on the last, fork the session so it inherits that context and skips the redundant re-exploration.

Advanced 4 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Add a Semantic Cache to Skip the Call on Near-Duplicate Queries

Prompt caching only discounts the repeated prefix of an exact request — you still pay for the call. A semantic cache embeds each incoming query and returns a stored answer when a past query is close enough, eliminating the LLM call entirely for paraphrased duplicates.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Let the agent decide when to compact, not a token threshold

Threshold-triggered compaction fires mid-derivation and throws away partial results you already paid to compute. Give the agent a compaction tool plus a rubric so it compacts on sub-task completion instead of at an arbitrary token line.

Advanced 4 min Unlock →

♻️Prompt Caching & Reuse 🔒 Pro

Swap an Agent's Toolset Mid-Conversation Without Busting the Cache

A new Anthropic beta lets an agent add or remove tools between turns without invalidating the cached prefix, so phase-switching agents stop paying a full cold read every time the toolset changes.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Compress MCP tool schemas with a lazy-loading proxy

When you wire up several MCP servers, the full schema for every tool they expose rides along on every turn — a fixed tax that dwarfs your actual prompt. Put a compression proxy in front that exposes just two tools and lets the model fetch a tool's full schema only when it actually needs it.

Advanced 4 min Unlock →

🧠Context Management 🔒 Pro

Turn On OpenAI's Server-Side Compaction for Hour-Long Agent Runs

A multi-hour OpenAI agent re-sends its entire swelling transcript on every turn. Set a compaction threshold and the Responses API summarizes old turns server-side into one encrypted item that carries state forward in far fewer tokens.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Render Bulky Static Context as an Image and Pay the Per-Image Token Cap

Text input is billed per token with no ceiling; an image is billed by pixel area and hits a hard per-image cap no matter how many characters it depicts. For genuinely huge, static, lossy-tolerant blocks, re-rendering them as a dense image changes the encoding you're billed under without removing anything from the model's view.

Advanced 5 min Unlock →

🧠Context Management 🔒 Pro

Set reasoning.context to all_turns So Your Agent Stops Re-Deriving Chains of Thought You Already Bought

On the Responses API, whether earlier turns' reasoning items get rendered into the next sample is the model's default unless you say otherwise — so a multi-turn agent can silently re-derive conclusions it already reached and billed you for. `reasoning.context: "all_turns"` retains them. Switch back to `current_turn` the moment that reasoning goes stale.

Advanced 4 min Unlock →

🧠Context Management 🔒 Pro

Rebuild the prompt fresh each turn, don't append the transcript

In a long agent loop, appending every observation and action to a running transcript makes each turn cost more than the last. Instead, assemble a fresh, fixed-size user message per decision by typed retrieval from an external store, so prompt size stays constant no matter how long the run gets.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Query Huge Inputs as a Code Variable, Not a Prompt (Recursive Language Models)

Instead of pasting a giant corpus into the context window, load it as a variable in a code sandbox and let the model write code to chunk it and fire cheap recursive LM calls over the pieces — so the root model orchestrates the answer without ever ingesting the full text.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Retrieve Tools by Relevance Each Turn Instead of Loading the Whole Catalog

Keep the full tool/skill catalog out of the context window in a local index and, each turn, use keyword + semantic retrieval to inject only the handful of tool schemas relevant to the current step.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Self-GC: a side-channel planner that folds, masks, or prunes each context object

Most context trimming is a single binary rule: keep the object or drop it. Self-GC turns every user turn and tool span into an indexed object, then runs a separate planner that picks per object among three graded actions - fold (move the exact payload to a sidecar and leave a recovery pointer), mask (keep the structural boundaries but elide the low-signal middle), or prune (drop it, with no recovery guarantee) - so most trimming stays recoverable and only genuine dead-ends are dropped for good.

Advanced 5 min Unlock →

🧠Context Management 🔒 Pro

Let the API Auto-Clear Stale Tool Results Mid-Run

In long agent loops, every old file dump and search result gets re-billed as input on each new turn. Anthropic's context-editing beta makes the API itself swap stale tool results for short placeholders once you cross a token trigger, so you stop re-paying for context the agent no longer needs.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Build a Sliding Window with a Cached, Stable Prefix

For API apps, cap history at the last N turns and put unchanging instructions first so prompt caching can discount the prefix.

Advanced 2 min Unlock →

🧠Context Management 🔒 Pro

Store Each Inter-Agent Message Once, Then Reference It

In multi-agent orchestrator loops, a single inter-agent message body can get serialized into two places at once: the replayed conversation history AND the tool-result blocks. You then re-pay input tokens for the same payload twice on every replayed turn, and the waste compounds as the thread grows. Store each message once and reference it by ID so the duplicate disappears — a structural fix, not an after-the-fact prune (shipped in Claude Code 2.1.212).

Advanced 4 min Unlock →

🧠Context Management 🔒 Pro

Strip Credentials from Tool Schemas and Inject Them at Execution Time

Remove oauth_token, api_key, and client_id parameters from every tool's JSON schema and resolve them at call time from a vault keyed on connected_account_id, so auth fields never occupy your context window on any turn.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Prune tool output by relevance, not by age

A sliding window evicts old tool results and keeps recent ones — but "recent" and "relevant" are different things. Score each tool result against the CURRENT task with a small learned scorer and drop the low-scoring blobs regardless of age, keeping the one critical result from twelve turns ago that a window would have thrown away.

Advanced 3 min Unlock →

🧠Context Management 🔒 Pro

Validate the Summary Before You Drop the Old Context, Not After

Naive summarize-and-drop compaction saves tokens right up until the summary silently loses a load-bearing fact and the agent falls off an accuracy cliff. Add a validation gate: before evicting the old turns, verify the fresh summary actually preserves the must-keep facts. Only drop once the check passes.

Advanced 4 min Unlock →

128 ways to spend fewer tokens

Move Every Non-Urgent Job to the Batch API and Pay Half Price

Ask for the Patch, Not the Whole File

Drop the Screenshot Once the Model Has Read It

Count Tokens Before You Hit Send, Not After the Bill Arrives

Stop Paying Frontier Prices for Boilerplate Work

Return IDs and Enums, Not Sentences

Freeze the Prefix: One Stray Timestamp Kills Your Whole Cache

Stop Paying for 'Please' and 'I Was Wondering If'

Cache the Context, Not Just the Answer

Classify a Whole List in One Call, Not One Row at a Time

Run /clear Between Tasks in Claude Code Instead of Letting Context Pile Up

Paste the Function, Not the Whole File

Set Hard Spend Caps in the Provider Console

Strip the Preamble: Ask for the Answer Only

Share One Cached System Prompt Across All Your Users

Fence the Output Before It Wanders

Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question

Deduplicate and Cache Identical Requests Before They Ever Hit the API

Add a .cursorignore So Cursor Stops Indexing Your node_modules

Start a New Chat When the Topic Changes

Set max_tokens as a Hard Cost Ceiling, Not an Afterthought

Emit CSV, Not Markdown Tables

Fetch the Data the Agent Always Needs Before the Loop Starts

Return only changed lines when an agent re-reads a file

Run /context and Strip MCP Tool Schemas You Never Call

Cap the loops by count, not tokens: spawn and search limits in Claude Code

Buy More Reasoning on the Cheap Model Before You Upgrade the Tier

Use Stop Sequences to Cut Generation the Instant You Have Enough

Cache Your Tool Definitions, Not Just the System Prompt

Ask for the Diff, Not the Director's Cut

Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks

Add a Relevance Gate: The Cheapest LLM Call Is the One You Don't Make

Scope Copilot Chat With #file Instead of @workspace

Compact Context by Deleting Dead Weight, Not Summarizing It

Log input_tokens and output_tokens on Every Call to Find Your Real Waste

Cascade: Try the Cheap Model First, Escalate Only When It Fails

Abort the Stream the Moment You Have Enough

Carry Gemini Conversation State Server-Side Instead of Resending the Transcript

Two Sharp Examples Beat Eight Bloated Ones

Filter by Metadata Before You Search

Use Targeted Retries with Backoff Instead of Blindly Re-Sending

Use Inline Completion for Trivial Edits, Save Chat for Reasoning

Convert Fetched Pages to Markdown Before They Hit the Agent's Context

Give Each Feature a Token Budget and Enforce It with max_tokens

Set service_tier flex for Batch Prices on the Sync Endpoint

Serialize Uniform Tables as TOON, Not JSON

Know Your Minimum: Short Prompts Silently Refuse to Cache

Reference Your Data, Don't Re-Paste It Every Turn

Add a Reranker and a Hard Token Budget: Retrieve 20 Candidates, Send Only the Best 3

Stack Prompt Caching on Top of Your Batch Jobs for Compounding Savings

Load Skill Names Up Front, Bodies Only When Triggered

Keep agent threads under the long-context price cliff

Don't Burn Reasoning Tokens on Tasks That Don't Reason

Order Your Prompt by Volatility: Tools, then System, then the Question

Prune Bulky Tool Results Once You've Used Them

Tag Every Call So You Know Which Feature Burns Tokens

Compress Long Threads with a Rolling Summary

Run an Async Queue with a Concurrency Cap Instead of Firing All at Once

Call Tools as Code, Not by Injecting Every Tool Definition

Abort the Agent the Moment It Stops Making Progress

Measure Cost-Per-Success, Not Cost-Per-Call

The Harness Effect: Fix Your Orchestration Before You Reach for a Cheaper Model

Cap runaway agents with Anthropic task budgets, not just max_tokens

Compact When the Cache Is Already Cold, Not When the Transcript Gets Big

Factor Your System Prompt and Cap the Output

Embed and Summarize Once: Stop Re-Tokenizing the Same Documents on Every Query

Compile a Repeated Skill's Stable Steps Into Code, Keep the Model Only for Judgment

Turn Off Auto Codebase Context When You Don't Need It

Annotate Context as Dependency-Linked Episodes, Then Evict Graduated

Wire Up Spend Alerts and a Token Circuit Breaker Before You Need Them

Pin effort on the Managed Agents agent resource, not the session

Design a Compact Output Schema (and Skip the Pretty-Printing)

Content-Hash Your Wire Blocks to Find the Exact One That Breaks the Cache

Try Zero-Shot Before You Pay for Examples

Go Hybrid to Retrieve Less

Collapse Many Tiny Calls into One Structured Request

Estimate, Execute, Expand: Read the Minimum, Widen Only on Failure

Let the API compact the conversation server-side (compact-2026-01-12)

Give Your Agent Fleet a Real Dollar Ceiling: --max-budget-usd Now Halts Background Subagents