Stemma: Helping you trust your data

Published in

Stemma

4 min readJun 2, 2021

Today, we are excited to announce the launch of Stemma — a fully managed data catalog, powered by Amundsen, the leading open-source data catalog with the largest community and broadest adoption. We raised $4.8M in seed funding led by Sequoia to bring the power of the leading open-source data catalog to every organization.

The problem: Too much data, too little trust

Over the last decade companies have first captured more data and then made it accessible to more and more people within the company.

Everyone has access to data but few know what exists, what’s trustworthy, and how to use it.

This leads to a huge productivity loss and a big risk for the company. This problem doesn’t just impact the organization. It’s deeply visceral to data consumers like data analysts, data scientists, and business users and data producers like product and data engineers. It impacts them every day.

Analysts and Data Scientists deliver inaccurate reports and models because they inadvertently use the wrong source or incorrect logic. Even worse, the data keeps changing underneath them. Data gets delayed, deprecated or completely shut off and analysts and data scientists are the last ones to find out.

Data Engineers, on the other hand, are constantly bogged down with keeping everyone informed about the current status and upcoming changes to data. Data owners don’t know exactly what and who a change will impact, so they spray and pray. They blast their users with blanket emails that no one reads, let alone remembers.

First attempt: Gossip Protocols

The first attempt to bring sense to this lack of trust is the natural human response — gossip protocols of Slack and shoulder-tapping.

You create a Slack channel #ask-analytics where users ask questions like “What is the source of truth for X?”. It may take a few days to get an answer, but over time you get repeat questions. You wish people searched the Slack channel before they asked the same question over and over again.

It gets worse. Data evolves. And now you wish people would never search the Slack channel, lest they find out-of-date info. You’re thinking about limiting the retention to a couple of weeks.

But it gets worse. Wrong data leads to wrong conclusions: two different departments show two different forecasts for gross shipments during a board meeting, and suddenly everyone realizes this isn’t working, and something must be done!

Second attempt: Curated data catalog

The second attempt is to curate and document information about data — descriptions, experts, dependencies, update frequency, foreign keys, sample queries, and the list goes on. You document this either in a wiki or buy a full-blown product that’s simply just a data-aware wiki.

Sometimes you try to get an army of volunteers to enter documentation into this wiki. If you are lucky, you get the first set of docs in, but it starts rotting the day it’s written because the writing documentation requires the user to leave their existing flow.

Sometimes, you find someone (aka data steward) whose job it is to ensure this documentation is entered and remains up-to-date. But this doesn’t work because data stewards, while super valuable, don’t have context on all the uses of data, so they end up relying on data experts, bringing us back to the same problem as the army of volunteers.

Curation doesn’t work.

Amundsen — the leading open-source data catalog

After experiencing the faults of gossip protocol and curated data catalogs, I co-created Lyft’s data catalog, Amundsen, to solve the challenge of trust in data through automation. Amundsen is widely adopted at Lyft — used by 750 users every week with 75% of Data Analysts, Data Scientists, and Data Engineers using it every week. To this day, Amundsen is the highest-ranked Data & Analytics product at Lyft.

Amundsen is the leading open-source data catalog with the largest community and adoption. It’s used by 35+ companies today including Square, Instacart, ING, Brex, Asana, iRobot, and many others. You can join Amundsen’s growing community on Slack and read more about the project here.

Today, we are excited to launch Stemma’s product — bringing the power of Amundsen and more to all of you.

Stemma — bringing the power of Amundsen to you

Stemma is building on top of Amundsen and adds value in two ways: a) enterprise management — super easy deployment with enterprise-grade security. b) intelligence — automated documentation like common filter and join conditions, related Slack conversations, and a personalized experience based on the user’s role and activity.

We have a lot of exciting stuff coming up and we are going to make Stemma and Amundsen better in a myriad of ways. More on that in a later blog post. Subscribe to our blog to stay in touch. We’re deeply dedicated to making Amundsen the de-facto choice for anyone who needs a modern data catalog, and to empower organizations to harness its power through the press of a button with Stemma.

Dorian and I started Stemma in 2020 to help bring the power of automated data catalog to the market. Today, we are also super excited to announce our seed funding from Sequoia.

Learn more, see our demo, and get started at stemma.ai.