Quantifying token waste in typical MCP architectures
A typical enterprise MCP deployment exposes dozens to hundreds of tools. Each tool carries a description, parameter schema, and usage instructions. These tool definitions are sent to the LLM on every request as part of the context window. A deployment with 80 tools can consume 15,000 to 25,000 tokens in tool definitions alone, before the user has asked a question.
Beyond tool definitions, agents make redundant metadata fetches. When an agent queries a table, it receives raw results. To understand the results, it calls the catalog for column descriptions. Then ownership. Then quality scores. Then deprecation status. Each call consumes tokens for the request, the response, and the agent reasoning about what to do next. A single "describe this table" workflow can consume 3,000 to 5,000 tokens across multiple round trips.
Session-level waste compounds the problem. If an agent queries the same table twice in a conversation, it may re-fetch the same metadata. If it queries a related table, it may re-fetch overlapping context. Without session awareness, every interaction starts cold.
Where the tokens go
Tokens consumed by tool definitions alone before the user asks anything
Deployment with 80 tools
Tokens per describe-table workflow in a naive multi-call architecture
Four-plus round trips
Per-session token reduction with cross-enrichment, filtering, and dedup
vs naive MCP deployment
Cumulative tokens saved per 20-exchange session by persona filtering alone
25K → 6K per request
Cross-enrichment consolidation
The first efficiency mechanism is cross-enrichment: enriching every tool response with context from complementary services. When an agent describes a table through Trino, the response includes DataHub metadata automatically. One enriched response replaces four or more separate tool calls.
The token savings are direct. Four tool calls averaging 800 tokens each (request + response + reasoning) consume 3,200 tokens. One enriched response consumes 1,200 tokens. The savings scale linearly with the number of tables and datasets an agent interacts with in a session.
Cross-enrichment also eliminates the agent reasoning overhead between calls. Without enrichment, the agent must decide what additional context to fetch, formulate the request, interpret the response, and decide if more context is needed. With enrichment, the agent receives complete context in a single response and proceeds directly to answering the question.
Naive vs. enriched token budget
Naive multi-tool MCP
Per-session budget: 65,000 tokens
- Tool definitions25,000 tokens
- 80 tools, every request
- Round-trip chatter16,000 tokens
- 4× describe/ownership/quality
- Redundant re-fetches9,000 tokens
- No session awareness
- Work tokens15,000 tokens
- Actual reasoning + answer
Plexara enriched MCP
Per-session budget: 26,000 tokens
- Tool definitions6,000 tokens
- Filtered to analyst persona
- Enriched single calls5,000 tokens
- One response, full context
- Dedup savings0 tokens
- Suppressed repeats
- Work tokens15,000 tokens
- Same answer, same model
Tool visibility filtering
The second mechanism is tool visibility filtering. Persona-based access control determines which tools an agent can see, not just which tools it can call. An analyst persona might see 15 query and catalog tools. The same deployment might expose 60 tools total, but the analyst agent never receives the other 45 tool definitions.
This reduces the tool definition overhead from 25,000 tokens to 6,000 tokens on every request. Over a session with 20 exchanges, the cumulative savings reach 380,000 tokens. At typical API pricing, this translates directly to reduced cost per session.
Visibility filtering also improves agent accuracy. An agent with 15 relevant tools makes better tool selection decisions than one parsing 60 tool descriptions. Fewer options means less reasoning overhead and fewer incorrect tool selections that waste tokens on failed or irrelevant calls.
Three mechanisms
At the response
Cross-enrichment
Every tool response is enriched with context from complementary services. Four calls collapse to one.
3,200 → 1,200 tokens per describe flow
At the schema
Visibility filtering
Persona-based access control determines which tools an agent can see, not just call. Unused tool descriptions never enter the context window.
25,000 → 6,000 tokens per request
Across the turn
Session dedup
Metadata provided earlier in a conversation is not re-sent on subsequent calls against the same entity. Enrichment stays active, duplication is suppressed.
Compounds with session length
Session-aware deduplication
The third mechanism is session-aware deduplication. The platform tracks which metadata context has been provided within a conversation. If an agent described a table earlier in the session, subsequent queries against that table do not re-send the same metadata. The enrichment is still active, but duplicated context is suppressed.
Deduplication is particularly effective in exploratory sessions where an agent queries multiple tables in the same schema or follows lineage across related datasets. Overlapping metadata (shared owners, common tags, related glossary terms) is provided once and referenced subsequently.
The combination of all three mechanisms reduces per-session token consumption by 40 to 60 percent compared to a naive multi-tool MCP deployment. For organizations running thousands of agent sessions per day, this represents a meaningful reduction in LLM API costs.
Fewer tools that return richer responses. A single describe-table tool returns schema, context, quality, ownership, lineage, and deprecation in one call.
Why fewer, richer tools outperform many narrow ones
The MCP ecosystem trend is toward tool proliferation: one tool per API endpoint, resulting in MCP servers with 50 to 200 narrow tools. Each tool does one thing. The agent must orchestrate multiple tools to accomplish any useful task.
Plexara takes the opposite approach: fewer tools that return richer responses. A single describe-table tool returns schema, business context, quality signals, ownership, lineage, and deprecation warnings. The agent receives everything it needs in one call and can proceed to answering the question.
This architectural choice has a compounding effect on token efficiency. Fewer tools mean smaller tool definition overhead. Richer responses mean fewer round trips. Fewer round trips mean less agent reasoning overhead. The result is sessions that accomplish more with fewer tokens.
