The context gap in AI data access

The accuracy problem, quantified

Give an AI agent access to a database and it will generate SQL. Give it access to the same database with business definitions, column descriptions, ownership records, glossary terms, and data quality signals, and it will generate accurate SQL. The difference is not marginal. Peer-reviewed research on the BIRD benchmark shows that semantic context improves text-to-SQL accuracy by up to 20 percentage points over schema-only approaches.

This finding is consistent across production deployments. LinkedIn built its internal Trino chatbot on a foundation of knowledge graphs and semantic context, not just schema introspection. Every serious enterprise deployment of AI-powered data access has converged on the same architectural conclusion: schema is necessary but not sufficient.

Gartner projects that 60 percent of agentic analytics projects relying solely on MCP without a semantic layer will fail by 2028. The prediction targets a specific failure mode: agents that can reach data but cannot understand what the data means.

Evidence

+20pp

Accuracy improvement from semantic context over schema-only text-to-SQL

BIRD benchmark, peer-reviewed

60%

Agentic analytics projects on MCP without a semantic layer projected to fail

Gartner, by 2028

12,751

Question/SQL pairs with explicit knowledge evidence in the BIRD evaluation set

95 databases

Schema-only access is the dominant failure mode researchers are now measuring directly.

Why schema is not enough

A database schema tells an agent that a column named "amt" exists in a table called "txn_detail" with type DECIMAL(12,2). It does not tell the agent that "amt" represents gross transaction amount in cents, that it includes tax, that it should be divided by 100 for display, that it was renamed from "total_amount" in Q3 2025, or that the finance team considers it deprecated in favor of "net_amt" for revenue reporting.

Without this context, the agent will use "amt" in queries where "net_amt" is correct. It will format the value in dollars instead of cents. It will join on tables that have been superseded. The queries will execute successfully and return plausible results. The results will be wrong.

More sophisticated prompts do not compensate for missing business context. Retrieval-augmented generation helps only when the relevant documentation exists, is current, and is retrievable. In most enterprises, it is none of these things. The business context lives in the heads of experienced team members who learned it through years of working with the data.

Schema vs. enriched response

Schema-only response

{
  "table": "txn_detail",
  "columns": [
    { "name": "amt",
      "type": "DECIMAL(12,2)" },
    { "name": "created_at",
      "type": "TIMESTAMP" }
  ]
}

Is amt dollars or cents?
Does it include tax? Is it gross or net?
Is it the field you should be using at all?

Plexara enriched response

{
  "table": "txn_detail",
  "description": "Line-item transactions",
  "owner": "finance-platform",
  "columns": [{
    "name": "amt",
    "type": "DECIMAL(12,2)",
    "semantic": "Gross transaction (cents)",
    "includes_tax": true,
    "deprecated_for": "net_amt",
    "glossary": "gross_margin",
    "quality": 0.94
  }]
}

Correct unit, correct aggregation, correct field.
Deprecation caught before the query runs.
One response, no additional catalog calls.

The same tool call, with and without protocol-level enrichment. The agent sees the column; only the enriched version tells it what to do with it.

What semantic enrichment adds

Semantic enrichment delivers business context at the protocol level, before the agent makes decisions about how to query. When an agent describes a table through a query engine, the enriched response includes not just the schema but also column descriptions, data owners, classification tags, glossary term mappings, deprecation warnings, data quality scores, upstream and downstream lineage, and active incidents.

This transforms the agent interaction from "here are the columns and their types" to "here are the columns, what they mean, who owns them, how they relate to business concepts, whether they are reliable, and what you should watch out for." The agent receives in a single response what would otherwise require four or more separate tool calls to a catalog, a quality monitoring system, an ownership registry, and a lineage tracker.

The consolidation matters architecturally, beyond convenience. Every additional tool call consumes tokens, increases latency, introduces failure modes, and requires the agent to synthesize information from disparate sources. A single enriched response eliminates these costs and gives the agent complete context every time.

Schema tells an agent what columns exist. Context tells it what they mean. Only the second one produces queries you can trust.
The thesis

The cross-enrichment approach

Bidirectional cross-enrichment is the mechanism that makes semantic enrichment practical at scale. When an agent queries through a SQL engine, metadata from the catalog is included in the response. When an agent searches the catalog, availability information from the query engine is included in the results. When an agent inspects an object in storage, catalog metadata about that object is included. Each service enriches responses from the others.

This approach is fundamentally different from expecting agents to call multiple services and correlate the results. Correlation requires the agent to understand the relationship between services, to know which catalog entity corresponds to which query engine table, and to handle the case where metadata exists in one system but not another. Cross-enrichment handles all of this at the platform level, transparently.

The enrichment is session-aware. Once business context for a dataset has been provided in a conversation, it is not repeated in subsequent responses within the same session. This deduplication further reduces token consumption without sacrificing context quality.

Cross-enrichment

Plexara

1 call

enriched response

Query engine

schema + types

Catalog

descriptions + ownership

Quality

scores + incidents

Lineage

upstream + downstream

Glossary

business terms linked

One enriched response replaces what would otherwise be four or more cross-service correlations. Deduplicated per session.

Industry validation

The BIRD benchmark, designed specifically to evaluate text-to-SQL with external knowledge, demonstrates that models with access to business definitions, value descriptions, and domain knowledge significantly outperform models working from schema alone. With 12,751 question-SQL pairs across 95 databases and explicit knowledge evidence, it is one of the most rigorous tests of context-aware SQL generation.

Production architectures at scale reinforce this finding. Enterprise deployments that achieve reliable AI data access invariably include a semantic or context layer between the agent and the data. The specific implementation varies, but the architectural pattern is consistent: raw schema access produces unreliable results; contextually enriched access produces trustworthy ones.

The market is recognizing this pattern. Catalog vendors, semantic layer vendors, and query engine vendors are all adding context delivery mechanisms. The debate has moved from whether context is necessary to how it should be delivered. The most efficient approach is enrichment at the protocol level: every response enriched automatically, no additional tool calls required, no agent-side correlation logic needed. A standalone catalog or semantic layer bolted on as a point solution leaves the agent to do that correlation itself.

The cost of getting it wrong

An inaccurate query that executes successfully is more dangerous than one that fails. A failed query surfaces an error. An inaccurate query surfaces plausible-looking numbers that may inform decisions before anyone realizes they are wrong. In regulated industries, inaccurate data access can trigger compliance violations. In financial services, it can produce incorrect risk calculations. In healthcare, it can lead to flawed analyses of patient outcomes.

The context gap compounds over time. Agents that generate inaccurate queries erode trust in AI-powered data access. Teams that lose trust in the results revert to manual processes, negating the investment in AI infrastructure. Gartner predicts that 50 percent of AI agent deployment failures by 2030 will stem from insufficient governance enforcement at runtime, a failure mode that is closely linked to the absence of business context at the point of query execution.

Closing the context gap is the prerequisite for trustworthy AI data access, and no model upgrade or prompt engineering substitutes for it. What closes it is better context, delivered automatically, at the protocol level, on every response: an agent that already knows what "amt" means before it writes the query.

The context gap in AI data access

The accuracy problem, quantified

Why schema is not enough

What semantic enrichment adds

The cross-enrichment approach

Industry validation

The cost of getting it wrong

Related reading

110 - Is MCP just an API wrapper?

Two front doors, one governed surface

Meeting enterprise systems where they are

Cookie Preferences