The data production-consumption gap

Image for post
Image for post

All recent innovation in data has taken place in two areas — helping data engineers produce data, and helping data consumers (primarily data analysts and scientists) consume that data. Data warehouses and lakes are flooding with data, but the consumers still don’t know what exists and what to trust, turning them into data swamps.

The biggest gap in data-driven organizations, however, doesn’t sit in the production or consumption of data but right between them. Data Engineers continuously report being bombarded by questions from users while striving to deliver it on time and with high quality. Analysts and Data Scientists spend a huge amount of time answering questions around the source of truth of data, how it is usually used, how it gets produced, and validating that it’s the right source for them to use. At Lyft, over 30% of analyst time was wasted in finding and validating trusted data. This story is not unique to Lyft.

This gap is so huge and untamed that I have decided to leave Lyft to solve this for every organization. Here’s more on why.

Innovation in ingestion, processing and storage led to ease of producing raw and derived data. End Result: more data to consume, manage and protect.

On the consumption side, organizations are democratizing data access to users who previously would have required more expertise. More and more roles that previously weren’t data-driven, or had to go through silos to use data, are now using it directly. As more organizations become more data-driven, the number of such users is only going to increase. These users are the citizen data scientists. All modern data-driven tech companies — Lyft, Airbnb, Uber, Google — have citizen data scientists. The future is citizen data science.

The ease of producing data and democratization of data access has led to two new problems that didn’t exist before.

If the organization’s ever-increasing data is accessed by a larger population of citizen data scientists, it leads to two new problems that didn’t exist in the past:

  1. Data Discovery & Trust
  2. Data Governance

In a world where data is accessed by a larger population of users, you find users who don’t use that data day in, day out. In such cases, the context needed to trust the data is much higher, leading to a huge barrier to entry.

Take a modern data organization like Lyft, for example. When a data scientist is creating a new version of the ETA model, they have to validate its performance against the existing model. At a company like Lyft, there are 50–100, if not more columns related to ETAs, often spread across different data warehouses. Questions like what is the source of truth for ETAs, is it still being populated, how is it calculated, who or what else uses it, how often it gets updated, how does it usually get used, etc. are a big time sink. And these are just the surface level questions not taking into account that the same data can come from different sources (e.g., ETAs from different map providers), mean different things in different contexts (ETA measured before a ride request, during the ride, or actual ETA) or have different usages (ETA displayed to drivers/riders vs being used algorithmically to make decisions). This is a big barrier to entry for citizen data scientists and a huge distraction for Data Engineers.

Image for post
Image for post

To discover and trust data, you don’t need perfect data. You need context on the imperfections.

So, what are these imperfections and their context? Examples include seeing if a column is still being populated, doesn’t have too many nulls or inconsistent values, to more broadly, understanding what the table or column contains, how it’s generated and used, etc.

How do users find the context on imperfections? They use out-of-date wiki pages, ask around, guess, browse through logs, run ad hoc queries on the data to find trustworthy data. Users end up writing and re-running the same ad hoc queries wasting precious time and resources.

Historical efforts have relied on humans. Sometimes, such employees have a specific title — like a data steward. Sometimes, it’s just a volunteer army. It never works, because either the users are taken out of their flow to document data sets, or they don’t have context on all the uses of data or both. This documentation, of course, gets out of date.

The key to solving this is a metadata platform that captures this metadata automatically and powers opinionated products based on it. It needs to capture the ABCs of metadata¹:

Application Context — what & where is the data, its shape/stats, etc.

Behavior — who produces & consumers this data (humans and programs)

Change — how has the data and code producing data changed over time?

Meanwhile, a new data stack is being established. This stack uses Stitch/Fivetran and Kafka for ingestion, Airflow for orchestration, a data lake for processing, a data warehouse (like Snowflake or BigQuery) for serving, and Looker/Tableau for consumption.

The good news is that this new data stack makes it possible to automatically capture metadata, enabling citizen data scientists to effectively discover and trust data.

In order to discover & trust data, you need a metadata platform that automatically captures metadata about the data and provides context into data imperfections.

The second problem with a large user base is safeguarding the security and privacy of the company’s data subjects, and complying with regulations, including but not limited to data protection regulations (e.g. GDPR, CCPA) as well as domain-specific ones (like in healthcare or finance). Moreover, just complying with regulations isn’t enough for large organizations. They carry a higher obligation to maintain their brand and fulfill their social responsibility.

The status quo processes for data governance in organizations are overly manual or consist of coarse-grained blanket policies. As regulations become stricter, like California passing Prop 24 as an extension of CCPA, this status quo becomes untenable.

Remember the data steward and the all-volunteer army? The historical products in this space were built around those concepts. It comes as no surprise that they have terrible NPS scores, low adoption, and are generally reviled by developers and users because they create more work than they solve.

The future of data governance lies in a deeper understanding of:

  • what data an organization has, where it is stored,
  • who accessed it, why, and when.

who accessed it, why, and when.

The good news is that this is the same metadata described in previous section that helps us discover and trust data. We can use the same metadata to help organizations gain a deeper understanding of their data and make it easier for them to protect access to data and comply with regulations. More on this in a future post.

In order to govern data based on ever-changing needs, you need a deeper understanding of what, where is the data, who accessed it, why and when. This is the same metadata that’s used to discover and trust data.

Looking ahead

We created Amundsen² at Lyft to solve the above two problems — a problem for the individual user (discovering and trusting data) and a problem for the organization (data governance). As of today, Amundsen has 700+ users every week at Lyft and 28 other companies using it, including ING, Square, and Instacart, plus a Slack community of over 1000 members. Huge credit for that goes to the Amundsen team and leadership at Lyft, and many people at dozens of companies contributing to Amundsen. Out of fear of missing someone, I won’t mention names here, but Amundsen is what it is because of them.

Amundsen is the first step in bringing a metadata platform to the market. Lyft has an amazing data-driven culture and invested deeply in solving these problems but they are not unique to Lyft. The more I worked on Amundsen, the more clear it became that these problems were far more severe in so many diverse enterprise settings.

I have decided to leave Lyft to help solve this problem for every organization out there.

I am grateful to Lyft’s leadership who gave me the opportunity and resources to deeply understand the problem and build Amundsen and the cross-functional Amundsen team who worked tirelessly to make the project what it is today.

I am co-founding Stemma with Dorian to help users discover & trust data, and organizations to safeguard the privacy and security of their data subjects. Stemma and I, personally, are committed to open-source and Amundsen. We will continue to invest time and resources to further Amundsen to provide value for producers and consumers of data through investments in data discovery, lineage, UX, and more.

I am excited about the future and how Amundsen can help organizations become more data-driven while safeguarding the privacy and security of their data subjects.

I’d love to hear your thoughts, feel free to get in touch on Twitter or LinkedIn. Follow Stemma here.

[1]: For a deeper dive into ABCs of metadata, check out the Ground paper by Joe Hellerstein, Vikram Sreekanti et al.

[2]: You can read more about Amundsen in this blog post.

Writer, Engineer, Poet (mark.thegrovers.ca)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store