How to Make Your Data Catalog Successful

Learnings from dozens of companies on how to make your data catalog successful

Mark Grover
Towards Data Science

--

PHOTO BY JOSHUA SORTINO ON UNSPLASH

There are only 2 goals that matter when it comes to measuring the success of a data catalog: 1) adoption, and 2) customer satisfaction. If you nail these two, you are successful.

I’m the co-creator of the leading open-source data catalog, Amundsen, which is used by 35+ companies including Instacart, Square, Brex, Asana, and many more. In this post, I share key learnings from Lyft, other Amundsen adopters, and Stemma customers on what makes a data catalog install successful.

There are learnings that we have incorporated in Stemma, but this article captures learnings that haven’t been captured in the product yet. These learnings focus on how to launch the product, how to land it for great adoption, and how to measure success.

1. Prioritize a persona and its use-cases

There are many user personas and use cases for a data catalog. Successful installs prioritize which personas and use cases to focus on first. Here’s a simplified view¹ of the most common personas and use-cases for a data catalog. It’s less important which persona you start with first, but more important that you start with a specific target group of users.

Image by author: Most common personas and use-cases for a data catalog

2. Launch in phases

In this section, I’ll dive deeper into best practices for launching your data catalog.

Step 1: Identify a small set of tables to get alpha user feedback on.

  • This set can be the most commonly used tables within the company (often referred to as “core” tables) or one domain within the company like marketing, growth or finance, etc.
  • More often, I have seen core tables being the chosen set, partly because they are the most impactful, but also because there’s often a central data team responsible for maintaining them.

Step 2: Populate MVP metadata on these tables.

  • This is where most data catalogs fail. In order for users to get value out of them, descriptions, tags, owners, etc. need to be curated. However, this isn’t sustainable without having an army of data stewards, and this documentation quickly becomes out of date. This is the single biggest reason why data catalogs fail. Avoid this pitfall by choosing an automated data catalog for the majority of data and curate only the most impactful data.
  • Where you must, for tribal knowledge, it helps to do a “docs jam session” with a group of data producers and consumers. You can even offer a reward (like a gift card) for those that put in the most docs!

Step 3: Alpha launch to 5–20 alpha users.

  • It’s best for alpha users to be ultra-vocal users. These will be from the prioritized persona you chose earlier. These users will become the data catalog’s avid supporters when you launch to a broader audience.
  • Incorporate feedback and iterate. Some types of feedback are super valuable here, like when someone says, “Oh, we already have this metadata in this spreadsheet — we should pull that in here, too.”

Step 4: Beta launch to all users of the prioritized persona.

  • It’s important to focus your beta launch on your prioritized users (data consumers, for example). One common mistake is to dilute the focus of your launch by opening up to all personas. That doesn’t mean that you should lock out other personas from the data catalog, it just means that you sequence which personas to focus on first.
  • Graduate to GA if you can meet success metrics targets. More on that in a later section on measuring success.

3. Land for great adoption

In order to get great adoption, here are a few best practices I have seen work:

  • Update Slack channel headers where people ask each other questions. Product features can be super helpful here — for example, if your catalog has Slack integration and can link these conversations to the catalog automatically.
  • Embed into new hire training. Tagging data sets per domain (marketing, growth, etc.) can help new hires quickly onboard to their domains. If you have existing training, showcase the catalog as an entry point. At Lyft, we had all tech new hires instrument a metric during onboarding. They used Lyft’s data catalog for discovering and understanding the right data for that task.
  • Linkages with other products. Create links between various data tools. For example, auto-populate a link between Airflow DAG that populates a table and the table page in the data catalog (and vice-versa). Another impactful link is between the table page in the data catalog and a link to the code that is used to generate the table.
  • Showcase the catalog at a group or company meeting. Deliver a short 5-minute demo at an all-hands meeting that targets persona users. Educate, answer questions, and thank your alpha users — it’s super impactful by creating more awareness and an opportunity to learn.

4. Measure success

Like I said earlier, adoption and customer satisfaction are the only two goals that matter. I dig further into what specific metric definitions to use for each of them:

1. Adoption:

  • WAUs: I’d suggest starting off with Weekly Active Users (WAUs) instead of Daily Active Users or Monthly Active Users. Common usage frequency is weekly, not daily or monthly.
  • Target Penetration rate: 80%. A great penetration rate is 80% within your target persona.

2. Customer Satisfaction (CSAT):

  • Measure out of band periodically. In my experience, out of band (not in product), CSAT feedback measured periodically (every 3 or 6 months) is better than getting feedback within the data catalog product. I have learned that when feedback is measured in the product, the most recent experience can tarnish the feedback shared by the user.

There are a few other metrics that companies often consider: documentation quality, search quality, etc. However, my recommendation is to stick to the core metrics at the onset. As your data catalog matures and you ingest more metadata into your data catalog over time, you can use those specific metrics to track the impact of those various improvements.

I hope this step-by-step guide helps inform you and your team as you navigate your data catalog install and makes your data catalog successful. The right data catalog can greatly reduce the overhead of curation. However, the above steps still play a huge role in ensuring your success, regardless of the data catalog you choose.

Want to learn more about Stemma’s fully managed data catalog? Check out the demo and get started on stemma.ai.

--

--