The Alter Ego of Data

August 08, 2020

3 min read

Metadata is simply data about data. It is a description and context. In my opinion, metadata is as important as the actual contents. How can you possibly trust the contents if you don’t know who collected it, how and why?

This post is a part of Data Governance From an Engineering Perspective, a series of posts about Data Governance and Metadata.

We are here:

Introduction
1. Data Governance From an Engineering Perspective
2. The Alter Ego of Data (this post)
3. Tools in the Data Management Zoo
4. With Data Comes Responsibility
Physical system
Data models
Business processes & Compliance

Data collection

We collect more data than earlier. No doubt about that. A wide range of tools allow us to capture, store and analyze items that 20 years ago were not considered (documents, videos, pictures, sound). Yet, how accurate the data is?

Have you ever wondered what is the difference between the fact and the stored data?

How does a transaction or a customer profile differ from the reality?

Instead of getting into philosophical discussions, let me say it straight away - we need context for data to be meaningful.

Source: https://www.memecenter.com/

You can see context as data’s ecosystem. A common vocabulary and a set of relationships between components. The more we know about that ecosystem, the better we can interpret the data within it.

For example, you have a low-grade fever - 37.3 C. There is a chance you got a cold virus and you should rest

Provided you met a COVID-19 infected person a few days earlier changes the context. You have to isolate yourself now.

Another example - troubleshooting the infamous blue screen of death.

506f89928e892c4af5d6d5f5875c97f5e468eb93

Source: https://www.flickr.com/photos/9704498@N05/3383841393/

By looking at the error message you won’t solve the problem. To find out the root cause of the error, you want to look into:

Windows logs
Running processes
Memory dumps
User activity
Hardware logs

Gathering surrounding data and making a timeline of events shows potential causes. The more context and metadata you get, the more meaning you extract.

Data unanswered questions

Metadata is data about data. It is a description and context.

Metadata is as important as the actual contents. How can you trust the contents if you don’t know who collected it, how and why?

Look at a sales table below:

acdfe5016c1477d551f59f16730dec9871fd593b

We know the answers to some of the questions:

How much did John paid for a a bike? Answer: 800 (euros, dollars, kroner?)
How many T-shirts were sold on the 10th of August? Answer: 5
How much revenue did sales generate on the 9th of August? Answer: 24.20
You get the point… :)

While other questions are impossible to answer:

Developer: What systems will I affect if I rename “Total price” column to “total_price”?
Business person: What is the definition of “Product” column? Does it correspond to an internal product hierarchy?
Data architect: What customer sensitive data it contains?
Auditor: How was the “Total price” calculated? Does it contain tax?
Data Warehouse Architect: What is the source system and how often does it refresh?
Security engineer: Who has access to the data?
Business person #2: Who is the product owner?