Why catalogs stay empty
Every enterprise has a data catalog. Most of them are empty. The tables are registered, the schemas are synced, and the descriptions are blank. Column-level documentation hovers near zero percent coverage. Glossary terms exist but are not linked to the datasets they define. Ownership records are stale.
This is not a tooling failure. The catalog products are capable. The failure is structural: documentation is treated as a separate activity from data usage. A data engineer who discovers that a column represents gross margin in cents has no mechanism to record that discovery at the point of discovery. Instead, the engineer is expected to open the catalog, find the correct entity, edit the column description, and save the change. This takes minutes for a single column. Multiply by thousands of undocumented columns across hundreds of datasets, and the backlog becomes permanent.
The result is a catalog that is technically deployed and functionally vacant. AI agents that rely on this catalog for context receive empty descriptions and missing ownership records. The agents generate queries based on schema alone, reproducing the accuracy problems that the catalog was supposed to solve.
Starting state
Typical column-level documentation coverage in enterprise catalogs
Industry observation
Documented columns in txn_detail before and after 30 days of normal usage with knowledge application
Illustrative catalog entity
Capture sources feeding the review pipeline
Users, agents, gap detection
Three capture sources
Plexara captures knowledge from three sources, each addressing a different gap in catalog documentation.
The first source is user-provided knowledge. When a user corrects an agent during a conversation ("that column is gross margin, not revenue" or "always filter on status=active for this table"), the correction is captured as a structured insight with the specific entity, the type of correction, and the suggested catalog change. The user does not need to open a separate tool or navigate to the catalog. The knowledge is captured in the flow of work.
The second source is agent-discovered insights. When an agent queries data and observes patterns ("column amt appears to be in cents based on value ranges" or "this table has not been updated since January"), these observations are captured as lower-confidence insights flagged for human review. The agent does the analytical work; humans validate the conclusion.
The third source is enrichment gaps. When the semantic enrichment middleware processes a tool response and finds missing metadata ("this table has no description," "12 columns have no documentation," "no owner is assigned"), the gap is recorded automatically. Over time, the gap log becomes a prioritized list of documentation debt, ranked by how frequently each undocumented entity is accessed.
Three capture sources
User-provided
A correction in the flow of work
"That column is gross margin, not revenue."
Confidence: high
Agent-discovered
Observed patterns flagged for review
"Column amt appears to be in cents based on value ranges."
Confidence: medium
Enrichment gaps
Missing metadata logged automatically
"Table has no description. 12 columns undocumented."
Confidence: detected
The review pipeline
Nothing writes to the catalog without human approval. This is a deliberate design decision. AI-generated documentation that bypasses human review accumulates errors that are difficult to detect and expensive to correct. The review pipeline ensures quality while minimizing the effort required from reviewers.
Captured insights flow through four stages: capture, review, synthesis, and application. During capture, the insight is recorded with its source, confidence level, and suggested changes. During review, an administrator evaluates the insight for accuracy and relevance. During synthesis, related insights about the same entity are combined into cohesive documentation. During application, the approved changes are written to the catalog as a tracked changeset.
Every changeset includes full provenance: which insights contributed, who approved the changes, and what the previous values were. Administrators can rollback any changeset to restore the prior state. This audit trail satisfies governance requirements while giving teams confidence to approve changes knowing they are reversible.
The review pipeline
- Stage 01
Capture
Source, confidence, suggested change
- Stage 02
Review
Administrator evaluates accuracy
- Stage 03
Synthesize
Related insights merged coherently
- Stage 04
Apply
Tracked changeset written to catalog
The maturity flywheel
The knowledge application system creates a self-reinforcing cycle. Usage generates insights. Insights improve documentation. Better documentation improves agent accuracy. Better accuracy drives more usage. Each rotation of the cycle makes the data platform more valuable.
This flywheel effect distinguishes knowledge application from one-time documentation initiatives. A documentation sprint produces a snapshot that begins decaying immediately. The knowledge application system produces documentation that improves continuously because it is connected to the ongoing activity of people using data.
The rate of improvement is proportional to usage. Datasets that are queried frequently accumulate documentation faster. Columns that are discussed in conversations get descriptions sooner. Business terms that are explained to agents get linked to glossary entries. The documentation naturally prioritizes the data that matters most to the organization.
Maturity flywheel
Knowledge
application
Usage
Insights
Documentation
Accuracy
Concrete before and after
Consider a table called "txn_detail" with 24 columns. Before knowledge application: the table has no description, no column documentation, no linked glossary terms, and an owner record pointing to an employee who left the company two years ago. An agent querying this table relies entirely on column names and types to generate SQL.
After 30 days of normal usage with knowledge application enabled: the table has a description generated from three user corrections synthesized during review. Eighteen of twenty-four columns have descriptions, sourced from user corrections (8), agent discoveries (6), and synthesis of multiple insights (4). Three glossary terms (gross margin, net revenue, transaction status) are linked. The owner record has been updated based on a user correction identifying the current steward. Six enrichment gap flags remain for columns that were not discussed during the period.
No one ran a documentation initiative. No one assigned a documentation task. The catalog improved because people used the data, and the platform captured what they already knew.
Concrete outcome
Day 0: catalog entry
txn_detail
Day 30: same table
txn_detail
- description
- —
- description
- Synthesized from 3 user corrections
- documented columns
- 0 / 24
- documented columns
- 18 / 24 (8 user, 6 agent, 4 synth)
- glossary links
- 0
- glossary links
- 3 (gross margin, net revenue, status)
- owner
- Former employee (2024)
- owner
- Current steward (updated)
Why this capability is rare
Building a knowledge application system requires control of both the execution layer and the metadata layer. A catalog vendor can build the review pipeline but cannot capture insights from query sessions because the catalog does not execute queries. A query engine vendor can capture observations from query patterns but cannot write documentation back to the catalog because the engine does not manage metadata.
The knowledge application system works because the same platform that executes queries also manages the metadata catalog and provides the review workflow. Corrections captured during a Trino query session are written to DataHub through an admin pipeline hosted on the same platform. The architecture is vertically integrated by design.
This vertical integration also explains why the feature cannot be replicated by stitching together point solutions. A separate capture tool, a separate review tool, and a separate catalog require three integration projects and continuous synchronization. The complexity is prohibitive, which is why most enterprises have a catalog and a query engine but not a knowledge feedback loop connecting them.
