Amundsen Monthly Update — June 2021

Mark Grover

Published in

amundsen-io

7 min readJul 13, 2021

Summary

June highlights from the Amundsen community:

How Convoy got 80% of its tech org to use Amundsen
Query Intelligence: View common query patterns in Amundsen
Unifying shared dependencies — making life a lot easier for Amundsen devs and installers
Amundsen now uses Mode discovery API
Initial support to use MySQL as backend metadata store
See popular dashboards in Amundsen
Lineage graph visualization

All that and more details below!

Check out last month’s highlights here.

Don’t forget: Join our Slack community at slack.amundsen.io. We can’t wait to meet you!

How Convoy got 80% of its tech org to use Amundsen

Video: How Convoy got 80% of its tech org to use Amundsen

Chad Sanderson and Daniel Dicker, from the Data Platform team at Convoy, joined us at our June community meeting and spoke about their success in implementing and onboarding their data consumers to Amundsen. They dove into the gaps that led them to make investments in Amundsen, their data discovery goals, implementation journey, and where they’re looking to take Amundsen into the future.

Gaps in Data Discovery

In early 2020, the team at Convoy started experiencing issues with data discovery. It was time-consuming to share information about what was in the data warehouse and the overhead of updating documentation kept increasing. Lots of conversations were happening in Slack and information became siloed, with institutional knowledge steadily becoming the norm. It was difficult for a single person to have a holistic view of the Convoy data model. With Convoy constantly growing and changing, the team knew they had to find a solution that would scale and allow a better data discovery experience.

Data Discovery Goals

Chad, Daniel, and their team had 4 main goals as they searched for a solution:

Lower the barrier to entry for adding informative, useful information. They wanted to make this as easy as possible for data consumers.
Improve search functionality to enable users to quickly find relevant information.
Add more useful context around the data: Who is using tables? Who are the owners of each table? What queries are being run against tables?
Strong foundation to build additional features as their data discoverability tool grew and changed with the business.

Choosing Amundsen

Chad, Daniel, and their team chose Amundsen for three main reasons: feature set, familiar technologies, and growing community. Since its implementation, the catalog has been a great success — 80% of the Product and Engineering team has used it over the past year, with over 4,500 searches per month.

Check out Chad and Daniel’s presentation to hear more about how they rolled out Amundsen to the broader team and upcoming plans for their data catalog.

Query Intelligence: View common query patterns in Amundsen

Video: Query Intelligence: View common query patterns in Amundsen

We also heard from Grant Seward, founding engineer at Stemma, at our June community meeting. He gave a presentation about query intelligence and how Amundsen is able to show common query patterns now.

As we think about the data discovery lifecycle, we’ve all had some form of these kinds of questions pop into our minds: What data exists? How do I join these tables together? What is the right way to form these queries? Are there any “gotchas” that exist in the data?

Grant gives a framework on how data discovery provides comprehensive answers to six fundamental questions:

What? Where? Who? When? Why? How?

However, Amundsen has had limited coverage for the question, “How?”

Query intelligence aims to augment our understanding of how data can be used by providing historical examples, which can then be utilized as a starting point during the research phase or referenced during development.

Since queries are by far the most widely used way in which we interact with data, by indexing query components, we can answer questions such as:

What tables are most commonly joined to this table and how?
Are there any filters (where clauses) that are always applied when accessing this table?
What real-world queries are being used on this table?
What queries are relevant now, given the recent queries that have been executed?

The answers to these questions form the building blocks that provide context about the data prior to any hands-on exploration that is done. This intelligence will answer the “How?” question, helping you discover how to use your data.

Check out Grant’s presentation to see a live demo and learn how Amundsen indexes query components.

Developer Experience — Unifying shared dependencies

GitHub pull request for “refactor: shared dependencies unification #1163”

We’ve merged refactoring dependency management in monorepo. If you’re building Amundsen packages/docker image from monorepo, this change will likely affect you.

Problem

Prior to this change, Python packages were pretty much the same for each proxy. This made it hard to sync across all repositories, and as a result, each proxy was running its own version of the dependency. We were having a lot of issues resulting in amundsen-common mismatch between FE and Metadata.

Solution

By unifying dependencies shared across multiple services (for both core and testing dependencies), all packages will be located in one place, which will keep dependencies similar across versions. This will make it possible to open a single PR that will update multiple packages.

What can you expect from this change?

Installation from source — when installing from the source, use “pip3 install -e .” to ensure dependencies are properly pulled
Building docker images — we moved docker images to root directory of Amundsen repo
Makefile actions were adjusted to this change — you can do “make install_deps” OR “make image” to install from the source, OR build docker image respectively. The same goes for all docker-compose files available in monorepo.

Huge shoutout to Mariusz Gorski and the community for this refactor! You can see more details of the implementation here.

Migration to Mode discovery API

The Amundsen Mode dashboard integration has now been migrated over to the newly performant Mode discover API. 🎉 This will greatly improve the experience for those using the Mode dashboard integration.

Big thanks to Junda Yang from Brex and Neha Hystad from Mode for the support.

Initial support to use MySQL as backend metadata store

GitHub pull request for “feat: support mysql in metadata service #1182”

Thanks to WePay’s Xuan Shen’s contributions, Amundsen now has the initial support to use MySQL as the backend metadata store! This will allow you to use existing MySQL infrastructure as Amundsen’s backend without having to rely on a graph database. See more details here.

Xuan Shen gave a detailed presentation at our July community meeting. We’ll share more details in our next monthly update. ✨

Popular Dashboards

You can now see popular dashboards in Amundsen. You have always been able to see popular tables — we gave the same love to dashboards. BTW, popular here means most queried/most viewed dashboards, not popular based on Amundsen activity.

Thanks to Verdan Mahmood for his contributions here. 🙌

Lineage Graph Visualization

As part of Amundsen’s commitment to building in lineage exploration, we had two additional iterations on the graphical visualization UI for table lineage. We brought in atlas proxy support as well as multiple stability improvements following direct community feedback. Our next steps are to enter a brief period of aggregating community use cases and shaping the future of the lineage feature UX.

Huge props to Boyan Bonev for spearheading these UI/UX iterations!

Announcements

📣 We’d like to welcome Junda Yang, Amundsen’s latest committer (Junda’s GitHub). Junda was an early engineering member on the Data Tools team at Lyft working on Amundsen. He’s recently done a fantastic job integrating Amundsen with Mode dashboard and its latest discovery API. Looking forward to more contributions from Junda in the future! Amundsen is a Linux Foundation project, with an open governance model. Adding new committers like Junda, and growing our dev community, is an important, ongoing part of our project.

📣 We have deprecated the amundsen-atlas-types repository. Up until now, we maintained a separate package for creating proper entity types in Atlas in order to facilitate integration between Atlas and Amundsen. Moving forward, we are still keeping the approach of creating those entity types, BUT, we are moving the code directly to Amundsen repo. Moving forward, amundsen databuilder repo should be used instead of amundsen-atlas-types. Please see the tutorial for more details.

📣 We have a new version of flaskoidc. The new version available on PyPi is 1.0.2 and is backward incompatible. With this new version, we’ve moved away from flask-oidc and now rely on Authlib, there’s no need to maintain a separate clients_secrets.json file anymore. Configuration is cleaner and requires a few environment variables to enable e2e OIDC support. The access token is not being refreshed at the moment — this will be fixed in future releases. More details on how to configure this new version on github.

Coming up next…

ML Feature Discovery in Amundsen

A few months ago, we presented the first version of ML feature discovery in which Mariusz Strzelecki, from GetInData, had built V1 of what ML feature discovery and integration with Feast looks like. More recently during our July community meeting, Allison Suarez Miranda, from Lyft, shared what the next version of ML feature discovery in Amundsen looks like.

Next community meeting

Date: Thursday, August 5, 9am Pacific, 12pm Eastern, 6pm Central Europe
Add to your calendar: https://evt.to/huuaiauw

Join us on Slack: slack.amundsen.io

Subscribe for periodic updates: Medium & Twitter

Curated with ❤ by Stemma