Top 2 Reasons Why Data Catalogs Fail

Here are the top two reasons why, despite all good intentions, many data catalogs end up failing

Mark Grover
Towards Data Science

--

Photo by CHUTTERSNAP on Unsplash

You notice that data producers (data engineers, product engineers, analytics engineers) and data consumers (data analysts, data scientists, business users) don’t know what data exists, how to use it, and what is trustworthy. You see a ton of questions on slack and a real hit on the productivity of the data team.

You get a big fancy data catalog — it has lots of bells and whistles but months go by and no one uses it. You show it in your all-hands meeting, you send email blasts, you even reprimand people when they talk about data stuff in the Slack channel, pushing them to have this conversation within the data catalog.

But at the end of the day, the usage never sticks. Your data catalog never gets adopted.

I have seen this story play out so many times. Here are the top 2 reasons why, despite all good intentions, many data catalogs end up failing:

1. Catalog ghost town — lack of descriptions and metadata

In order to derive value from most data catalogs, you need to populate it with valuable information. This information includes descriptions, tags, primary keys, foreign keys, common ways to query the data, most common FAQs, etc. You get the idea. The hard part is getting your co-workers to enter this context.

Even if you somehow convinced others to add documentation, once it’s added, it quickly becomes out-of-date. Because data evolves quickly, not only does it need to be added once, it needs constant upkeep.

When upkeep doesn’t happen, well, your data catalog doesn’t help — it hurts. This story is the single most common story of data catalog failure.

I have seen the three best antidotes to this:

Automate as much as possible. Try to get as much metadata as possible through automation. There’s no reason why someone needs to populate foreign keys, primary keys, common ways to query the data for every single data set. All of that information is already buried and up-to-date in the existing usage patterns and logs. Extract the most common filters, join conditions, upstream dependencies, and downstream consumers. More on how to do that in a future blog post.

Extract documentation from “the flow”. Documentation is best obtained when it’s in the flow of the user. When a new data set is being created, you need to enforce documentation to be added then. Sometimes it’s a process change, but quite often it’s technical — I have seen quite a few successful deploys where you put in documentation checks in your CI/CD and break the build if the appropriate documentation isn’t supplied.

Curate the top 20%. There’s no denying that some things require a human to curate. Take your most commonly viewed metrics, dashboards, and tables and document them. The pre-requisite to doing this is realizing that you can’t boil the ocean and having an understanding of what are your most commonly queried tables and most commonly viewed dashboards.

2. Your catalog is too broad and not deep enough — leading to fragmentation

Often in understanding and trusting data, you need to answer and do the following:

  • What are the most common conversations about this data?
  • What knowledge base/wikis exist about this data?
  • Query the data to explore it further

More recently, catalogs have become quite bulky and cause fragmentation in the user’s golden path:

  • Conversations — You can have them on Slack or in your data catalog
  • Knowledge base — You can write wiki articles in Confluence or in your data catalog
  • Querying — You can query in your BI tool (Tableau, Looker, Mode, Snowflake UI) or you can query data via your data catalog.

This fragmentation is the worst possible thing. Why?

  • The users now have to figure out which tool to choose for what.
  • The data team now has to maintain the above two options. Take querying, for example. You now have to implement RBAC in your BI tool and your data catalog’s querying tool. Like implementing RBAC in just one tool wasn’t hard enough.

Ultimately, this fragmentation leads to enough friction, both individually and organizationally, for the data catalog to fail. Your catalog ends up doing a crappy job at too many things.

The antidote to this problem — if you have a catalog product that’s too bulky, choose your golden path thoughtfully and disable duplicate features amongst your products so that there’s one suggested option. For example, have one place for people to have conversations, either in your catalog or in Slack. Discourage the other option.

Summary

Data catalogs fail for two reasons:

  1. Catalog Ghost town — not enough descriptions & metadata.
  2. Catalog too broad, and not deep enough — fragmentation in the user’s golden path.

To avoid these problems, regardless of your tool choice, automate as much metadata as possible, curate only on the most impactful data, and be laser-focused on your user’s golden path.

To read more posts like this and stay in touch, follow me on Twitter or get a monthly newsletter with content like this by subscribing here.

--

--