00 · Token use
Same agent. Same model.
A fraction of the bill.
Long-running tool-using agents accumulate verbose outputs in their transcripts — shell logs, file reads, test results, browser traces, intermediate plans. By default these are appended directly to the conversation, producing context pressure that drives cost and latency up without proportionally improving decisions. Espada attacks this at the orchestration layer, between the agent and the model, with eight composable, model-agnostic reducers. Most are agent-transparent: a normal tool call yields a normal-looking result; the runtime decides how it persists and how it is resent on later turns.
-
81–93%
Lower cumulative tokens
Across long, tool-heavy agent sessions.
-
80–99%
Per-artifact reduction
On verification logs, workflow envelopes, historical results.
-
$10 → ~$1
Typical task cost
Same task, same model, same agent — Espada in the middle.
The eight reducers
- 01
Skill routing
Inject only the top-ranked skills for the active prompt. Ambiguous prompt? Keep the full set — capability never silently disappears.
- 02
Tool-result shaping
Rewrite workflow_run JSON envelopes into a dense text shape. Keys appear once per envelope, not once per operation.
- 03
Verification compression
Detect test, lint, and build logs. Strip routine progress on pass; preserve failure signal verbatim — assertions, errors, expected vs received.
- 04
Transcript virtualization
Old, large tool results are swapped for compact stubs that point back to the original. The freshest one or two turns stay full-fidelity.
- 05
Lossless raw refs
When a result is shaped, the raw text is persisted synchronously to an append-only sidecar. Nothing is dropped — only deferred.
- 06
Targeted retrieval
context_ref_get resolves a saved reference by default, by line range, or by substring query. Bounded reads, no surprise blow-ups.
- 07
Run-state store
A session-scoped key-value tool. The agent stops pasting plans, intermediate JSON, and partial results back into the transcript.
- 08
Argument compaction
Even run_state set arguments can be huge. The runtime compacts the on-wire value so the persisted transcript stays small.
How it composes
Espada's reducers sit at the runtime/orchestration layer. They compose with everything else you're already doing to keep context cost down — they don't replace it.
-
Provider prompt caching
Anthropic and OpenAI prompt caching reduces unit cost of unchanged prefixes by up to ~90%. Espada attacks volume. The two compose: shorter shaped tail content preserves more of the cacheable-prefix budget. Historical virtualization runs at indices already outside the typical cache window.
-
Contextual retrieval
Anthropic Contextual Retrieval operates over an external knowledge base. Espada's reducers operate over the agent's own past tool outputs within a single session — a different object. A deployment can use both at the same time.
-
Prompt-token compression
LLMLingua and Selective-Context are provider- and task-agnostic by design. They can't exploit the structure of a workflow_run envelope or distinguish 0 failed from failed. Espada's structural shapers can. Token-layer compression can be applied on top of shaped output.
All eight reducers are independently disable-able for ablation. Per-fixture studies, the full loss model, and the reproducibility protocol — corpus, baselines (LLMLingua-2, Selective-Context, naive truncation, single-LLM summary), provider tokenizers, and task-quality benchmarks — are in the research paper. Read the full design
Ready when you are.
One binary. One install. One hour to your first signed action.