We’ve convinced ourselves that comprehensive cataloging is the answer to data complexity. Catalog everything—every table, stream, video frame, GPS coordinate, document—and we’d have control. But consider: you deploy an enterprise catalog, wire up thousands of assets, and six months later analysts still ask Slack for “the right revenue table.”
The more we expand what counts as “data,” the more we might be building cathedrals when we need workshops.
The landscape has shifted. Data no longer means rows and columns. It means chat streams and video footage. It means LiDAR scans for highway sightlines or bridge clearances, and documents specifying electrical installations sealed behind tunnel ceilings. Drone imagery feeds algorithms that monitor bridge integrity. Each carries meaning, generates insight, demands governance.
Yet the traditional response—catalog it all—may be solving yesterday’s problem with tomorrow’s scale.
The Weight of Comprehensiveness
An enterprise data catalog promises one authoritative registry of all metadata across every data type. One interface. One source of truth. Complete lineage.
But operational reality tells a different story: the catalog becomes heavy. Maintaining metadata for structured tables is one challenge; maintaining it for video streams, chat logs, and infrastructure documents is another entirely. Who defines what metadata matters for a video clip? How do you catalog a LiDAR scan that only becomes meaningful when calculating dangerous cross-fall in a specific curve? How do you keep metadata fresh for data that exists only momentarily?
Modern AI-driven auto-tagging can be useful for pattern recognition, but proves insufficient alone—it often increases dilution by creating thousands of tags without business context and eroding user trust.
The economics become strained: cost to catalog must be weighed against asset value, and not every element justifies the investment.
The catalog becomes slow—not just in performance, but in adapting to new data types and reflecting how people actually work.
The catalog becomes diluted. When everything carries equal weight, finding the genuinely critical becomes harder. Users work around the catalog rather than with it.
In trying to make everything findable, we’ve created a landscape where what matters most gets buried.
Different Contexts, Different Needs
We’re conflating distinct contexts under one umbrella. A data engineer building pipelines has different needs than a governance officer maintaining Records of Processing Activities. A data scientist exploring features has different needs than a business analyst consuming pre-built reports or an ML engineer looking for AI-ready datasets.
Consider three fundamentally different functions:
The Platform Catalog is for builders, optimized for speed and operational metadata. It answers “What’s the schema?” and “How fresh is this?” Its value comes from being lightweight, fast, and tightly integrated with transformation tools.
The Governance Registry is for risk and compliance, optimized for coverage and traceability. It tracks what data exists, where it came from, who owns it, how it’s processed. It connects to data valuation systems and feeds regulatory requirements. Its value comes from comprehensiveness and auditability.
The Data Marketplace is for consumers, optimized for consumption-ready products. It showcases finished data products and data marts—analytics-ready datasets, AI-ready features, pre-built aggregations. Its value comes from preparing and packaging solutions for immediate use, not cataloging raw assets requiring interpretation.
These operate in different rhythms with different quality bars. Forcing them into one system creates coherence on diagrams while creating friction in practice.
The Modular Alternative
In practice, this means the Platform Catalog focuses on what’s actively in use—structured tables, streaming chat data, processed drone imagery being fed into bridge monitoring algorithms. Raw LiDAR scans waiting for a specific engineering calculation—massive in volume, uncertain in utility until a project needs them? Not its concern.
The Governance Registry focuses on compliance and risk. It tracks personal data regardless of location—customer records in databases, faces in video footage, voices in chat transcripts. It knows that electrical installation documents exist sealed behind tunnel concrete, creating liability even when physically inaccessible. But it doesn’t try to be a search engine for every table schema.
The Data Marketplace curates consumption-ready products. A processed dataset analyzing bridge structural trends from drone imagery—packaged as analytics-ready tables with clear business definitions—might appear here. LiDAR-derived datasets showing clearance heights and sightline calculations for highway planning, prepared as ready-to-use metrics, certainly would. An AI-ready feature set built from anonymized chat transcripts for sentiment analysis might qualify. Raw infrastructure documents, unprocessed scans, and tables requiring deep technical knowledge never appear.
This modularity costs you the single pane of glass, but gains fit-for-purpose tools. The key is ensuring these modules share intelligence through standards like OpenLineage or DCAT—turning what might seem like fragmentation into deliberate distributed architecture. That allows coordination without centralization, where each catalog stays fit for purpose while contributing to organizational coherence.
Modularity also forces honest questions: what doesn’t get cataloged? Not everything needs to be findable through a central catalog;
some data is perfectly discoverable in the operational systems where it lives.
And critically, we must decouple visibility from control. A nuclear reactor is heavily governed, but it’s not discoverable on a public library shelf. Similarly, documents detailing electrical installations behind tunnel concrete require rigorous version control and access policies at the source, regardless of whether they ever appear in a searchable directory.
Cataloging has costs. Those costs are worth paying for data meeting specific criteria: regulatory risk, reusability across teams, business criticality, or high frequency of access. But “data that matters” is smaller than “all data that exists.”
The Strategic Choice
The choice should be intentional. If your organization is governed by regulatory compliance in structured industries, the comprehensive Governance Registry might justify its weight. If your organization runs on a modern data platform where engineers build and transform, the Platform Catalog might suffice. If you’re democratizing data access across business units who need answers rather than raw materials, the consumption-ready Data Marketplace might deliver more value.
Most organizations need elements of all three. The real choice is about boundaries and interfaces. Where does one catalog’s responsibility end and another’s begin?
Perhaps the deepest question is whether cataloging itself remains the right metaphor. Catalogs emerged from static collections—library books, museum artifacts, product inventories arranged on shelves. Data today is fundamentally different. It streams, transforms, exists in multiple representations, disappears.
We might need something less like a catalog and more like a navigation system.
Cataloging assumes pre-indexing what exists; navigation assumes finding paths through what’s changing.
On-demand search for unstructured content, automated lineage that traces backwards from discovery, and recommendation engines driven by actual usage patterns.
The data landscape has expanded far beyond tables; our approach to making it navigable should expand too—not just in scope, but in conception. Sometimes that means better catalogs, sometimes different catalogs for different needs, and sometimes recognizing that not everything needs cataloging.
The paradox resolves when we stop trying to make everything findable and start making the right things findable in the right ways for the right people.
What’s your experience with data catalogs? Have you faced the paradox of comprehensive cataloging creating more noise than signal? I’d love to hear your thoughts in the comments.





Thanks Bjorn, it’s an important issue that has been important in the scientific method for centuries, i.e, “what constitutes evidence relevant to our analysis?”
In my view we need a few key pieces of data and information to be confident in our analysis and conclusions.
First, we need a “business ecosystem map” that is made up of the systems relevant to the issue at hand. These systems form “the indicative” map, i.e., those systems contributing data & information to our analysis. These can be tangible or intangible systems depending on the objective.
Second, we need to understand the kinds of evidence that may be relevant to our question, how to handle it, and awareness the biases commonly associated with them (see link below).
e.g., https://plato.stanford.edu/search/searcher.py?query=Evidence+
Collectively, the above could be seen as your “navigation system”. It still may be useful to visualise the data & information available in each system as a catalogue as long as those sources are evaluated for relevance following something like the Standard guide to “evidence”.
An interesting guideline question could be “how does our selection or omission of potentially relevant systems and/or the data & information associated with them affect the variance of our conclusions?