00 · Token use

Same agent. Same model.
A fraction of the bill.

Long-running tool-using agents accumulate verbose outputs in their transcripts — shell logs, file reads, test results, browser traces, intermediate plans. By default these are appended directly to the conversation, producing context pressure that drives cost and latency up without proportionally improving decisions. Espada attacks this at the orchestration layer, between the agent and the model, with eight composable, model-agnostic reducers. Most are agent-transparent: a normal tool call yields a normal-looking result; the runtime decides how it persists and how it is resent on later turns.

81–93%

Lower cumulative tokens

Across long, tool-heavy agent sessions.
80–99%

Per-artifact reduction

On verification logs, workflow envelopes, historical results.
$10 → ~$1

Typical task cost

Same task, same model, same agent — Espada in the middle.

The eight reducers

01
Skill routing

Inject only the top-ranked skills for the active prompt. Ambiguous prompt? Keep the full set — capability never silently disappears.
02
Tool-result shaping

Rewrite workflow_run JSON envelopes into a dense text shape. Keys appear once per envelope, not once per operation.
03
Verification compression

Detect test, lint, and build logs. Strip routine progress on pass; preserve failure signal verbatim — assertions, errors, expected vs received.
04
Transcript virtualization

Old, large tool results are swapped for compact stubs that point back to the original. The freshest one or two turns stay full-fidelity.
05
Lossless raw refs

When a result is shaped, the raw text is persisted synchronously to an append-only sidecar. Nothing is dropped — only deferred.
06
Targeted retrieval

context_ref_get resolves a saved reference by default, by line range, or by substring query. Bounded reads, no surprise blow-ups.
07
Run-state store

A session-scoped key-value tool. The agent stops pasting plans, intermediate JSON, and partial results back into the transcript.
08
Argument compaction

Even run_state set arguments can be huge. The runtime compacts the on-wire value so the persisted transcript stays small.

How it composes

Espada's reducers sit at the runtime/orchestration layer. They compose with everything else you're already doing to keep context cost down — they don't replace it.

Provider prompt caching

Anthropic and OpenAI prompt caching reduces unit cost of unchanged prefixes by up to ~90%. Espada attacks volume. The two compose: shorter shaped tail content preserves more of the cacheable-prefix budget. Historical virtualization runs at indices already outside the typical cache window.
Contextual retrieval

Anthropic Contextual Retrieval operates over an external knowledge base. Espada's reducers operate over the agent's own past tool outputs within a single session — a different object. A deployment can use both at the same time.
Prompt-token compression

LLMLingua and Selective-Context are provider- and task-agnostic by design. They can't exploit the structure of a workflow_run envelope or distinguish 0 failed from failed. Espada's structural shapers can. Token-layer compression can be applied on top of shaped output.

All eight reducers are independently disable-able for ablation. Per-fixture studies, the full loss model, and the reproducibility protocol — corpus, baselines (LLMLingua-2, Selective-Context, naive truncation, single-LLM summary), provider tokenizers, and task-quality benchmarks — are in the research paper. Read the full design

Ready when you are.

One binary. One install. One hour to your first signed action.

Install Espada Read the security model

Same agent. Same model. A fraction of the bill.

The eight reducers

Skill routing

Tool-result shaping

Verification compression

Transcript virtualization

Lossless raw refs

Targeted retrieval

Run-state store

Argument compaction

How it composes

Provider prompt caching

Contextual retrieval

Prompt-token compression

Ready when you are.

Same agent. Same model.
A fraction of the bill.