What you will take away from this lesson
In 202 - Your first day with Plexara, we walked through what platform_info loads and the three-step discover-query-enrich workflow the operating manual prescribes. This lesson zooms in on the first step: discovery, which on a Plexara MCP means the DataHub toolkit.
The pattern is simple. Before writing any query, the agent asks the catalog what exists and which query templates are appropriate. That discovery step is what makes the answer grounded and the path performant. Skipping it is the most common cause of wrong-turn sessions.
Learning Objectives
- 01Explain what DataHub is inside Plexara and why the discover step of the workflow runs through the DataHub toolkit.
- 02Identify the main DataHub tools (search, get_entity, get_schema, get_queries, get_lineage, get_glossary_term) and what each one returns.
- 03Read a datahub_search result and name the metadata fields attached to it (descriptions, owners, tags, glossary terms, lineage hints, deprecation status).
- 04Recognize the "query first" anti-pattern and use the short domain-framing prompt that eliminates it for exploratory work.
- 05Describe what a glossary term is and how it bridges business vocabulary to technical columns so the agent answers questions in your organization's language.
Where we are in the curriculum
If any term in this lesson feels unfamiliar, the 100 series is one click back. The 200 series assumes that mental model.
100 Series: the foundation
- 101What is a Large Language Model?Brilliant at language, blind about your data. Tokens, hallucination, the grounding problem.
- 102Tokens and your budgetSubscription-plan economics, session limits, and Plexara enrichment dedup.
- 103Context, compression, and memoryThe keep / compress / clear playbook and how memory carries across sessions.
- 104Frontier models, specialized models, and why enterprise AI uses bothThree knowledge sources (training, web search, tools). MCP as the exposure protocol.
- 105What is an AI agent?The think/call-tool/observe loop. Professor's knowledge, child's literalism.
- 110Is MCP just an API wrapper?MCP as an application layer. Spectrum from thin wrapper to full application server.
If a term in this lesson looks unfamiliar, back up to the 100 series. The 200 series assumes that mental model. Every row above is a direct link.
What DataHub is inside Plexara
DataHub is an open-source metadata platform that Plexara uses as the catalog backend. Its job is to hold everything the agent needs to know about your data before it touches any data: what datasets exist, what columns they contain, who owns them, what they mean in business language, which dashboards consume them, which ones have been deprecated, and which curated queries are appropriate against them.
Plexara exposes DataHub to the agent through a toolkit of related tools. Most of them are read operations a power user cares about on an ordinary day; a smaller number are write operations that administrators and knowledge curators use to keep the catalog current. Both groups live under the same toolkit, but persona-based governance decides which ones a given caller can see.
The DataHub toolkit (read-side tools most power users care about)
datahub_search
The broad entry point. Start here on any exploratory question.
Ranked catalog hits for a natural-language query, each with descriptions, owners, tags, glossary terms, and flags for deprecation or data quality.
datahub_get_entity
Once search has pointed at a likely dataset, pull the full record to confirm.
The full canonical record for a specific dataset, dashboard, data product, or other entity, identified by URN.
datahub_get_schema
Use before writing any query that references specific columns by name.
Column names, types, nullability, descriptions, tags, and any glossary-term bindings for a specific dataset.
datahub_get_queries
Prefer these over free-form SQL. They are faster, tested, and usually correct.
Curated, pre-benchmarked query templates associated with a dataset. Templates are annotated with performance characteristics and the aggregation patterns they support.
datahub_get_lineage
Useful when a number looks off and the question becomes "where does this value actually come from."
Upstream and downstream relationships for a dataset: which sources feed it, which reports and dashboards consume it.
datahub_get_glossary_term
Resolve ambiguous language. "Churn" or "active customer" rarely means the same thing across teams; the glossary says what your organization means.
The definition of a business term and the datasets, columns, or metrics bound to it.
The DataHub toolkit also includes write operations (datahub_create, datahub_update, datahub_delete, datahub_browse) that administrators and knowledge curators use. Those live next to the read-side tools above but are governed separately. See 207 - Governance: personas, access, and audit.
What datahub_search actually returns
The first tool in the toolkit is also the one used most often. A datahub_search call with a natural-language query returns ranked catalog hits, each fully enriched with the metadata the agent needs to decide what to do next. The enrichment is automatic; the agent does not make separate calls to fetch owners, tags, or glossary terms.
What a datahub_search hit actually contains
URN
urn:li:dataset:(urn:li:dataPlatform:postgres,warehouse.public.daily_sales,PROD)
The canonical pointer to a dataset. Pass this to get_entity, get_schema, or get_queries to drill in.
Description
Daily sales aggregated by store and region. Refreshed nightly at 02:00 UTC.
What the dataset is, in business language. Often the single most useful field for the agent.
Owners
Team: Retail Analytics; Steward: Sarah Chen ([email protected])
Who to ask if the description does not resolve an ambiguity. Surfaces in the answer when a user challenges a number.
Tags and glossary terms
tags: finance, retail, q3-critical. glossary: GrossRevenue, FiscalQuarter.
Machine-searchable and human-readable. The glossary terms especially matter: they bridge business vocabulary to technical columns.
Lineage hints
Upstream: transactions, stores. Downstream: exec_dashboard, cfo_weekly.
Fast answer to "where does this number come from" and "what breaks if we change this."
Quality and deprecation signals
quality_score: 0.94. deprecated: false. last_validated: 2026-04-18.
Tells the agent whether the dataset is trustworthy for the current question. Deprecated datasets are de-prioritized automatically.
Why discovery precedes querying
The operating manual that platform_info loads tells the agent to call datahub_search (and usually datahub_get_queries) before writing any query against the data. This is not a style preference. It is how the agent avoids inventing schemas, misreading column names, or picking the wrong definition of a metric.
The two-minute domain warm-up
Even with perfect catalog metadata, exploratory sessions benefit from a short domain warm-up at the start. Giving the agent a moment to describe the data estate in its own words activates the relevant slice of its training-time world knowledge and surfaces any obvious gaps before you ask a question that depends on them.
The enrichment Plexara attaches to every tool result is thorough, but it covers only what has been documented. Obvious-to-humans context (a retailer sells physical goods through stores, a bank charges fees across customer accounts, a SaaS vendor tracks monthly recurring revenue) is not in the catalog unless somebody wrote it down. A short warm-up closes that gap.
Glossary terms are how business language gets resolved
The most useful field in a catalog record is almost always the glossary term bindings. Questions phrased in business vocabulary ("show me active customers by region") reach a technical schema that almost certainly does not have a column called "active customers." The glossary term is how the agent resolves the mismatch. It is also how your organization's specific definition of "active" gets applied, rather than the one the model inferred from general training data.
When the enrichment is enough on its own
Not every session needs a domain warm-up. Narrow, specific questions with clear entity references ("daily revenue by region for 2025") carry enough context in the question itself that datahub_search plus the enrichment fills in the rest. The warm-up pays off most for new users, cross-domain questions, and anything where you are not already sure which dataset you want.
Where this leads
With the right dataset identified and a curated query template in hand, the agent is ready to run the query. The next lesson covers how Plexara reaches data through Trino and why picking the right query shape matters so much for performance.
Key terms
Six terms cover the vocabulary you will see across DataHub-related tool results, catalog documentation, and every conversation about the discover step.
Key Terms
- DataHub
- The open-source metadata platform Plexara uses as the catalog backend. Holds datasets, dashboards, data products, glossary terms, lineage, and quality signals. Plexara's DataHub toolkit is the read and write surface the agent and administrators use.
- URNunique resource name
- The canonical identifier for a catalog entity (dataset, dashboard, data product, glossary term). Passed to get_entity, get_schema, and get_queries to drill in after a search hit points at a specific record.
- Glossary term
- A named business concept with a formal definition and bindings to the datasets and columns that implement it. The mechanism by which business language (active customer, gross revenue) maps to technical columns.
- Lineage
- Upstream and downstream relationships for a dataset. Answers where a value comes from (upstream) and which reports and dashboards would break if the dataset changed (downstream).
- Data product
- A curated, named bundle of related datasets, dashboards, and queries that the organization treats as a consumable unit. A product has owners and a purpose; underlying datasets may be ingredients.
- Curated query template
- A pre-benchmarked, annotated query stored against a dataset in DataHub. Retrieved via datahub_get_queries. Preferred over free-form SQL because it is tested, fast, and already knows the right aggregation patterns.
