Tools in the Data Management Zoo

August 15, 2020

4 min read

There are many metadata management tools and solutions. Some specialize in data discovery, search, lineage, others in business processes. Which data management tool is the best? What features are must-have and nice-to-have in a data catalog? In this blog post

This post is a part of Data Governance From an Engineering Perspective, a series of posts about Data Governance and Metadata.

Introduction
1. Data Governance From an Engineering Perspective
2. The Alter Ego of Data
3. Tools in the Data Management Zoo (this post)
4. With Data Comes Responsibility
Physical system
Data models
Business processes & Compliance

It’s about time to give you more details and present available solutions. As there are many different tools with unique approaches - I decided to call it The Data Management Zoo.

If you haven’t been under a rock, you might have heard of data catalogs or business glossaries.

17ad4ecb690a8843f4c2d3ccc7192bb87a1edb58

I am overwhelmed by the sheer number of available tools and differences between them. Some of the questions I have:

Which data management tool is tge best?
What features are must-have and nice-to-have?
Which tool is best for cloud data platform?
What is the price?
Can I build my own solution?
…

There are so many questions and so little answers…

Metadata tool categories

In the latest Gartner’s reports, you find metadata management tools split into standalone and embedded tools.

Such separation makes sense. Yet, I would add one more option. Based on my experience and inputs from my readers, many decide to build custom solutions.

1. Standalone tools

Some solutions provide “complete” data governance and metadata management.

Informatica package (Enterprise Data Catalog, Axon, Data Quality)
Collibra
Alation
and other offerings

Open source community has two big projects:

Considerations: The standalone tools are as powerful as their crawlers and connectors. Also, integration into your data ecosystem and business processes is not straight forward.

2. Embedded tools

Increased customer interest in metadata makes it a lucrative market for vendors. New projects appear. As a result, data platform components include metadata management as a feature.

For example (tools with built-in metadata features):

Data preparation (Trifacta, Alteryx, Ab Initio, Talend, and more)
Data virtualization (Dremio, Denodo, …)
Data integration (Qlik Integration, …)
Access, policy and security (Privacera, Immuta, Okera)
BI & analytics platforms (SAS, Tableau, …)
Data lake enablement tools (Cloudera Navigator, Kylo, …)
Cloud data catalogs (Azure Data Catalog, AWS Glue, GCP Data Catalog)

Considerations: To achieve end-to-end governance, all data has to “flow through” embedded solutions. As a result, you might use the tools in the wider scope than you planned and increase lock-in. Otherwise, you end up with metadata silos.

3. Custom implementation

Use open source projects as a basis and build your solution around it
Keep it simple and focus on implementing must-have features first
Seek for native metadata support in your existent stack (database, visualization tools, ETL)
Define way of working and always apply data processing frameworks

Consideration: As always with custom implementation, you might end up reinventing the wheel.

“Hello world” with data governance

I see way to often consultants recommending big and standalone solutions by default. Also, all open-source metadata projects end up under the rug.

I don’t agree with such approach for three reasons:

From business perspective, clarify your requirements before you compare the tools.
From engineering perspective, installing and integrating an off-the-shelf product is boring.
Also, I tend to prefer incremental delivery over a big bang approach.

Instead of deconstructing which metadata tool is best, build your own MVP.

e1c29227ade5dda84b9c7fc5a75e24b69389cc06

Source: https://github.com/danistefanovic/build-your-own-x

First, start with a clean sheet

Štefan Urbánek, a former Facebook engineer, gave a talk about about the importance of metadata and architecture.

He presented the following approach:

1. Pick a metadata problem
2. Use a spreadsheet (users already have Excel or Google Sheets)
3. Suffer through the document-exchange phase
4. Use functional approach to metadata composition and application
99. (later) Move spreadsheets into a metadata repository

Second, use open source projects as a foundation

“We’ve used_ Marquez as a starting point and easily extended it to fit our needs such as enforcing security policies as well as changes to its domain language. If you’re looking for a small and simple tool to bootstrap […] Marquez is a good place to start.” - ThoughtWorks Technology Radar Vol.22

d1508e71d8819e0cc479d3d067cfda9d77136a18

Source: https://www.thoughtworks.com/radar/platforms/marquez

First, Marquez is a fresh, platform-agnostic open source project led by Julien Le Dem. Julien is one of the “Big Data” landscape shapers. He coauthored Apache Parquet, contributed to Apache Arrow.

fe1303663a588ec354e83c7831be8dd4227e1102

Marquez is still in early development stages. So running it in production might be risky.

Instead, look at the project as an educational journey. Figure out what you need, what is missing. Then decide on next steps, or even switch to the commercial tools.

Other open source alternatives

DataHub
- Metadata search & discovery tool
- A project from LinkedIn, released in Feb 2020. There are still major differences between the internal version and the open sourced.
Amundsen
- Data discovery & metadata engine
- A project from Lyft. Amundsen and Marquez joined LF AI as Incubation Projects
Metcat
- Metadata exploration API service
- Created by Netflix.
Apache Atlas
- Metadata management and governance.
- Proven solution in many Hadoop data governance battles
ODPi Egeria
- Metadata and governance system

Summary

There are many metadata management tools and solutions. Some specialize in data discovery, search, lineage, others in business processes.

What makes me mad as an engineer, is the difficulty to test and explore some commercial tools. You have to reach out to sales reps to get access, instead of pulling a docker image and running it.

If you want to start small and learn as you go, give open-source tools a chance. Take a look at Marquez. It is a rather small, but powerful metadata management project. Use it as a starting point. Or look at some other alternatives, like Amundsen or DataHub.