How AI usage becomes data documentation

Why catalogs stay empty

Every enterprise has a data catalog. Most of them are empty. The tables are registered, the schemas are synced, and the descriptions are blank. Column-level documentation hovers near zero percent coverage. Glossary terms exist but are not linked to the datasets they define. Ownership records are stale.

This is not a tooling failure. The catalog products are capable. The failure is structural: documentation is treated as a separate activity from data usage. A data engineer who discovers that a column represents gross margin in cents has no mechanism to record that discovery at the point of discovery. Instead, the engineer is expected to open the catalog, find the correct entity, edit the column description, and save the change. This takes minutes for a single column. Multiply by thousands of undocumented columns across hundreds of datasets, and the backlog becomes permanent.

The result is a catalog that is technically deployed and functionally vacant. AI agents that rely on this catalog for context receive empty descriptions and missing ownership records. The agents generate queries based on schema alone, reproducing the accuracy problems that come from a missing context layer that the catalog was supposed to solve.

Starting state

~0%

Typical column-level documentation coverage in enterprise catalogs

Industry observation

5→18

Documented columns in txn_detail before and after 30 days of normal usage with knowledge application

Illustrative catalog entity

Capture sources feeding the review pipeline

Users, agents, gap detection

The failure is not tooling. It is structural: documentation has always been an activity separate from usage.

Three capture sources

Plexara captures knowledge from three sources, each addressing a different gap in catalog documentation.

The first source is user-provided knowledge. When a user corrects an agent during a conversation ("that column is gross margin, not revenue" or "always filter on status=active for this table"), the correction is captured as a structured insight with the specific entity, the type of correction, and the suggested catalog change. The user does not need to open a separate tool or navigate to the catalog. The knowledge is captured in the flow of work.

The second source is agent-discovered insights. When an agent queries data and observes patterns ("column amt appears to be in cents based on value ranges" or "this table has not been updated since January"), these observations are captured as lower-confidence insights flagged for human review. The agent does the analytical work; humans validate the conclusion.

The third source is enrichment gaps. When the semantic enrichment middleware processes a tool response and finds missing metadata ("this table has no description," "12 columns have no documentation," "no owner is assigned"), the gap is recorded automatically. Over time, the gap log becomes a prioritized list of documentation debt, ranked by how frequently each undocumented entity is accessed.

Three capture sources

src 01

User-provided

A correction in the flow of work

"That column is gross margin, not revenue."

Confidence: high

src 02

Agent-discovered

Observed patterns flagged for review

"Column amt appears to be in cents based on value ranges."

Confidence: medium

src 03

Enrichment gaps

Missing metadata logged automatically

"Table has no description. 12 columns undocumented."

Confidence: detected

Each source addresses a different gap. Together they cover the full surface where organizational knowledge leaks away.

The review pipeline

Nothing writes to the catalog without human approval. This is a deliberate design decision. AI-generated documentation that bypasses human review accumulates errors that are difficult to detect and expensive to correct. The review pipeline ensures quality while minimizing the effort required from reviewers.

Captured insights flow through four stages: capture, review, synthesis, and application. During capture, the insight is recorded with its source, confidence level, and suggested changes. During review, an administrator evaluates the insight for accuracy and relevance. During synthesis, related insights about the same entity are combined into cohesive documentation. During application, the approved changes are written to the catalog as a tracked changeset.

Every changeset includes full provenance: which insights contributed, who approved the changes, and what the previous values were. Administrators can rollback any changeset to restore the prior state. This audit trail satisfies the same governance requirements that apply at execution time while giving teams confidence to approve changes knowing they are reversible.

The review pipeline

Stage 01
Capture
Source, confidence, suggested change
Stage 02
Review
Administrator evaluates accuracy
Stage 03
Synthesize
Related insights merged coherently
Stage 04
Apply
Tracked changeset written to catalog

Nothing writes to the catalog without human approval. Every changeset is rollback-able with full provenance.

The maturity flywheel

The knowledge application system creates a self-reinforcing cycle. Usage generates insights. Insights improve documentation. Better documentation improves agent accuracy. Better accuracy drives more usage. Each rotation of the cycle makes the data platform more valuable.

This flywheel effect distinguishes knowledge application from one-time documentation initiatives. A documentation sprint produces a snapshot that begins decaying immediately. The knowledge application system produces documentation that improves continuously because it is connected to the ongoing activity of people using data.

The rate of improvement is proportional to usage. Datasets that are queried frequently accumulate documentation faster. Columns that are discussed in conversations get descriptions sooner. Business terms that are explained to agents get linked to glossary entries. The documentation naturally prioritizes the data that matters most to the organization.

Maturity flywheel

Knowledge

application

Usage

Insights

Documentation

Accuracy

Datasets queried frequently accumulate documentation faster. The rate of improvement is proportional to usage. Each rotation makes the platform more valuable.

Concrete before and after

Consider a table called "txn_detail" with 24 columns. Before knowledge application: the table has no description, no column documentation, no linked glossary terms, and an owner record pointing to an employee who left the company two years ago. An agent querying this table relies entirely on column names and types to generate SQL.

After 30 days of normal usage with knowledge application enabled: the table has a description generated from three user corrections synthesized during review. Eighteen of twenty-four columns have descriptions, sourced from user corrections (8), agent discoveries (6), and synthesis of multiple insights (4). Three glossary terms (gross margin, net revenue, transaction status) are linked. The owner record has been updated based on a user correction identifying the current steward. Six enrichment gap flags remain for columns that were not discussed during the period.

No one ran a documentation initiative. No one assigned a documentation task. The catalog improved because people used the data, and the platform captured what they already knew.

Concrete outcome

Day 0: catalog entry

txn_detail

Day 30: same table

txn_detail

description: —
description: Synthesized from 3 user corrections
documented columns: 0 / 24
documented columns: 18 / 24 (8 user, 6 agent, 4 synth)
glossary links: 0
glossary links: 3 (gross margin, net revenue, status)
owner: Former employee (2024)
owner: Current steward (updated)

A single table, txn_detail, before and after 30 days of normal usage with knowledge application enabled. No one ran a documentation initiative.

Why this capability is rare

Building a knowledge application system requires control of both the execution layer and the metadata layer, the kind of reach a platform gets only when an agent can see the whole stack. A catalog vendor can build the review pipeline but cannot capture insights from query sessions because the catalog does not execute queries. A query engine vendor can capture observations from query patterns but cannot write documentation back to the catalog because the engine does not manage metadata.

The knowledge application system works because the same platform that executes queries also manages the metadata catalog and provides the review workflow. Corrections captured during a Trino query session are written to DataHub through an admin pipeline hosted on the same platform. The architecture is vertically integrated by design.

This vertical integration also explains why the feature cannot be replicated by stitching together point solutions. A separate capture tool, a separate review tool, and a separate catalog require three integration projects and continuous synchronization. The complexity is prohibitive, which is why most enterprises have a catalog and a query engine but not a knowledge feedback loop connecting them.

How knowledge application turns usage into documentation

Why catalogs stay empty

Three capture sources

User-provided

Agent-discovered

Enrichment gaps

The review pipeline

The maturity flywheel

Concrete before and after

Why this capability is rare

Related reading

102 - Tokens and your budget

105 - What is an AI agent?

202 - Your first day with Plexara

Cookie Preferences