The library

128 ways to spend fewer tokens

The 22 Beginner tips are free to read. The 106 advanced tactics unlock with Pro — plus a fresh tip in your inbox every morning.

All ⚙️ Batching & Automation (14) 💻 Coding Assistants (16) 🧠 Context Management (34) 📊 Measurement & Budgeting (14) 🎚️ Model Selection (8) 📐 Output Control (10) ♻️ Prompt Caching & Reuse (17) ✍️ Prompt Engineering (7) 🔎 Retrieval & RAG (8)

🔎Retrieval & RAG Eliminates the embedding API call and vector search on cache hits. The saved cost tracks your hit rate; it cuts embedding/retrieval spend, not the tokens sent to the LLM.

Cache the Context, Not Just the Answer

Cache the retrieved chunk set keyed by a normalized query, so popular or repeated questions skip the embedding call and vector search and reuse the same context block instead of rebuilding it every time.

Beginner 2 min Read →

🔎Retrieval & RAG 70-95% on document-heavy prompts

Stop Pasting Whole Documents: Retrieve the 3 Chunks That Actually Answer the Question

Dumping a full PDF or knowledge base into every prompt bills you for thousands of tokens the model never needed. Retrieve only the passages relevant to the question instead.

Beginner 2 min Read →

🔎Retrieval & RAG 🔒 Pro

Chunk on Structure, Not Character Count, So You Retrieve Fewer (and Smaller) Chunks

Naive fixed-length chunking splits ideas mid-sentence, forcing you to retrieve more chunks (and more overlap) to capture one answer. Chunk on semantic boundaries to send fewer tokens per query.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Filter by Metadata Before You Search

Attach structured metadata to chunks and apply WHERE-style filters before the vector search runs, so you embed and rank a smaller candidate set and stuff fewer off-topic chunks into the prompt.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Add a Reranker and a Hard Token Budget: Retrieve 20 Candidates, Send Only the Best 3

Vector similarity is approximate, so people inflate top-k to avoid missing the answer. A cheap reranking pass lets you fetch many candidates but send only the few that matter to the expensive LLM.

Intermediate 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Embed and Summarize Once: Stop Re-Tokenizing the Same Documents on Every Query

Re-embedding unchanged documents and re-summarizing the same sources on every run quietly burns tokens. Compute these artifacts once, persist them, and reuse provider-side prompt caching for stable context.

Advanced 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Go Hybrid to Retrieve Less

Combine a keyword (BM25) score with the vector score so exact-term matches rank first, letting you lower top-k because the right chunk lands near the top instead of being padded around with semantic near-misses.

Advanced 2 min Unlock →

🔎Retrieval & RAG 🔒 Pro

Cut Agent-Memory Tokens With Single-Pass Writes and Multi-Signal Recall

The common memory pipeline burns three LLM calls to store one fact and dumps a fat slice of the store on read. Collapse writes to a single ADD-only call, store agent inferences as first-class memories, and recall with fused multi-signal scoring that returns only what the turn needs.

Advanced 2 min Unlock →

Like what you see?

Get a fresh one in your inbox — weekly free, daily on Pro.

Subscribe free Go Pro — $9/mo