The data production-consumption gap

All recent innovation in data has taken place in two areas — helping data engineers produce data, and helping data consumers (primarily data analysts and scientists) consume that data. Data warehouses and lakes are flooding with data, but the consumers still don’t know what exists and what to trust, turning them into data swamps.

The biggest gap in data-driven organizations, however, doesn’t sit in the production or consumption of data but right between them. Data Engineers continuously report being bombarded by questions from users while striving to deliver it on time and with high quality. Analysts and Data Scientists spend a huge amount of time answering questions around the source of truth of data, how it is usually used, how it gets produced, and validating that it’s the right source for them to use. At Lyft, over 30% of analyst time was wasted in finding and validating trusted data. This story is not unique to Lyft.

This gap is so huge and untamed that I have decided to leave Lyft to solve this for every organization. Here’s more on why.

How did we get here?

On the consumption side, organizations are democratizing data access to users who previously would have required more expertise. More and more roles that previously weren’t data-driven, or had to go through silos to use data, are now using it directly. As more organizations become more data-driven, the number of such users is only going to increase. These users are the citizen data scientists. All modern data-driven tech companies — Lyft, Airbnb, Uber, Google — have citizen data scientists. The future is citizen data science.

The ease of producing data and democratization of data access has led to two new problems that didn’t exist before.

Two new problems in the gap

  1. Data Discovery & Trust
  2. Data Governance

1. Data Discovery & Trust

Take a modern data organization like Lyft, for example. When a data scientist is creating a new version of the ETA model, they have to validate its performance against the existing model. At a company like Lyft, there are 50–100, if not more columns related to ETAs, often spread across different data warehouses. Questions like what is the source of truth for ETAs, is it still being populated, how is it calculated, who or what else uses it, how often it gets updated, how does it usually get used, etc. are a big time sink. And these are just the surface level questions not taking into account that the same data can come from different sources (e.g., ETAs from different map providers), mean different things in different contexts (ETA measured before a ride request, during the ride, or actual ETA) or have different usages (ETA displayed to drivers/riders vs being used algorithmically to make decisions). This is a big barrier to entry for citizen data scientists and a huge distraction for Data Engineers.

To discover and trust data, you don’t need perfect data. You need context on the imperfections.

So, what are these imperfections and their context? Examples include seeing if a column is still being populated, doesn’t have too many nulls or inconsistent values, to more broadly, understanding what the table or column contains, how it’s generated and used, etc.

How do users find the context on imperfections? They use out-of-date wiki pages, ask around, guess, browse through logs, run ad hoc queries on the data to find trustworthy data. Users end up writing and re-running the same ad hoc queries wasting precious time and resources.

Historical efforts have relied on humans. Sometimes, such employees have a specific title — like a data steward. Sometimes, it’s just a volunteer army. It never works, because either the users are taken out of their flow to document data sets, or they don’t have context on all the uses of data or both. This documentation, of course, gets out of date.

The key to solving this is a metadata platform that captures this metadata automatically and powers opinionated products based on it. It needs to capture the ABCs of metadata¹:

Application Context — what & where is the data, its shape/stats, etc.

Behavior — who produces & consumers this data (humans and programs)

Change — how has the data and code producing data changed over time?

Meanwhile, a new data stack is being established. This stack uses Stitch/Fivetran and Kafka for ingestion, Airflow for orchestration, a data lake for processing, a data warehouse (like Snowflake or BigQuery) for serving, and Looker/Tableau for consumption.

The good news is that this new data stack makes it possible to automatically capture metadata, enabling citizen data scientists to effectively discover and trust data.

In order to discover & trust data, you need a metadata platform that automatically captures metadata about the data and provides context into data imperfections.

2. Data Governance

The status quo processes for data governance in organizations are overly manual or consist of coarse-grained blanket policies. As regulations become stricter, like California passing Prop 24 as an extension of CCPA, this status quo becomes untenable.

Remember the data steward and the all-volunteer army? The historical products in this space were built around those concepts. It comes as no surprise that they have terrible NPS scores, low adoption, and are generally reviled by developers and users because they create more work than they solve.

The future of data governance lies in a deeper understanding of:

  • what data an organization has, where it is stored,
  • who accessed it, why, and when.

who accessed it, why, and when.

The good news is that this is the same metadata described in previous section that helps us discover and trust data. We can use the same metadata to help organizations gain a deeper understanding of their data and make it easier for them to protect access to data and comply with regulations. More on this in a future post.

In order to govern data based on ever-changing needs, you need a deeper understanding of what, where is the data, who accessed it, why and when. This is the same metadata that’s used to discover and trust data.

Looking ahead

Amundsen is the first step in bringing a metadata platform to the market. Lyft has an amazing data-driven culture and invested deeply in solving these problems but they are not unique to Lyft. The more I worked on Amundsen, the more clear it became that these problems were far more severe in so many diverse enterprise settings.

I have decided to leave Lyft to help solve this problem for every organization out there.

I am grateful to Lyft’s leadership who gave me the opportunity and resources to deeply understand the problem and build Amundsen and the cross-functional Amundsen team who worked tirelessly to make the project what it is today.

I am co-founding Stemma with Dorian to help users discover & trust data, and organizations to safeguard the privacy and security of their data subjects. Stemma and I, personally, are committed to open-source and Amundsen. We will continue to invest time and resources to further Amundsen to provide value for producers and consumers of data through investments in data discovery, lineage, UX, and more.

I am excited about the future and how Amundsen can help organizations become more data-driven while safeguarding the privacy and security of their data subjects.

I’d love to hear your thoughts, feel free to get in touch on Twitter or LinkedIn. Follow Stemma here.

[1]: For a deeper dive into ABCs of metadata, check out the Ground paper by Joe Hellerstein, Vikram Sreekanti et al.

[2]: You can read more about Amundsen in this blog post.

Writer, Engineer, Poet (mark.thegrovers.ca)