Data Platform School
HomeLandscapeAbout me

The Alter Ego of Data

By Valdas Maksimavicius
Published in Data Governance
August 08, 2020
3 min read
The Alter Ego of Data

Metadata is simply data about data. It is a description and context. In my opinion, metadata is as important as the actual contents. How can you possibly trust the contents if you don’t know who collected it, how and why?


This post is a part of Data Governance From an Engineering Perspective, a series of posts about Data Governance and Metadata.

We are here:

  1. Introduction
    1. Data Governance From an Engineering Perspective
    2. The Alter Ego of Data (this post)
    3. Tools in the Data Management Zoo
    4. With Data Comes Responsibility
  2. Physical system
  3. Data models
  4. Business processes & Compliance

Data collection

We collect more data than earlier. No doubt about that. A wide range of tools allow us to capture, store and analyze items that 20 years ago were not considered (documents, videos, pictures, sound). Yet, how accurate the data is?

Have you ever wondered what is the difference between the fact and the stored data?

How does a transaction or a customer profile differ from the reality?

Instead of getting into philosophical discussions, let me say it straight away - we need context for data to be meaningful.

Source: https://www.memecenter.com/

You can see context as data’s ecosystem. A common vocabulary and a set of relationships between components. The more we know about that ecosystem, the better we can interpret the data within it.

For example, you have a low-grade fever - 37.3 C. There is a chance you got a cold virus and you should rest

Provided you met a COVID-19 infected person a few days earlier changes the context. You have to isolate yourself now.

Another example - troubleshooting the infamous blue screen of death.

506f89928e892c4af5d6d5f5875c97f5e468eb93

Source: https://www.flickr.com/photos/9704498@N05/3383841393/

By looking at the error message you won’t solve the problem. To find out the root cause of the error, you want to look into:

  • Windows logs
  • Running processes
  • Memory dumps
  • User activity
  • Hardware logs

Gathering surrounding data and making a timeline of events shows potential causes. The more context and metadata you get, the more meaning you extract.

Data unanswered questions

Metadata is data about data. It is a description and context. 

Metadata is as important as the actual contents. How can you trust the contents if you don’t know who collected it, how and why?

Look at a sales table below:

acdfe5016c1477d551f59f16730dec9871fd593b

We know the answers to some of the questions:

  • How much did John paid for a a bike? Answer: 800 (euros, dollars, kroner?)
  • How many T-shirts were sold on the 10th of August? Answer: 5
  • How much revenue did sales generate on the 9th of August? Answer: 24.20
  • You get the point… :)

While other questions are impossible to answer:

  • Developer: What systems will I affect if I rename “Total price” column to “total_price”?
  • Business person: What is the definition of “Product” column? Does it correspond to an internal product hierarchy?
  • Data architect: What customer sensitive data it contains?
  • Auditor: How was the “Total price” calculated? Does it contain tax?
  • Data Warehouse Architect: What is the source system and how often does it refresh?
  • Security engineer: Who has access to the data?
  • Business person #2: Who is the product owner?

Technical metadata example

Assume there is a system table with data types and descriptions.

d03cd802cd3ea7bed11800ea0b68fce3394c6906

Now, we know a bit more about our sales table: 

  • Column Data types - specified during table creation (possibility to use schema discovery)
  • Column Description - populated by users

Next, assume there is table level information.

f816b8bee5f4ff173e3a61f38b87361b78d8c296

How to put it place?

Suppose we use Azure SQL or SQL Server as our database. Both have a useful feature called “Extended properties”.

Extended Properties allow to store more information about database objects. Imagine this as comments on:

  • databases
  • tables
  • views
  • triggers
  • constraints
  • columns
  • stored procedures
  • functions
  • indexes.

For now, remember that each row in your business table has hidden data:

Check out for more examples on: https://dataedo.com/kb/data-glossary/what-is-metadata

Metadata problems

As I mentioned earlier, the more context and metadata you get, the more meaning you extract. 

Storing all the surrounding data has drawbacks too:

  • Violation of data privacy and the right to be forgotten
  • Storage space isn’t indefinite (less of a problem with cloud)
  • Complex and different formats, structures 

Also, as the series focus on an engineering perspective, here a few technical concerns:

  • How to keep in sync and up-to-date metadata?
  • Is there a user friendly UI or API to manage metadata?
  • Who populates it - technical teams or business users?
  • I use X database and Y ETL tool - do I need a custom build solution?
  • Source systems don’t have column descriptions. Should we populate it manually?

“To understand recursion metadata, you must first understand recursion metadata!”


Tags

#metadata

The latest set of Azure Data Platform best practices - April 2021

A simple blog post evolved to 25+ page guide with 75+ different recommendations.
Download Document
Previous Article
Data Governance From an Engineering Perspective
Next Article
Data Models
Valdas Maksimavicius

Valdas Maksimavicius

IT Architect | Microsoft Data Platform MVP

Resources

ADVERTISE WITH US

Topics

Data Architecture
Data Engineering
Data Governance
Miscellaneous

Related Posts

Apache Ranger Evaluation for Cloud Migration and Adoption Readiness
May 24, 2021
15 min
© 2021, All Rights Reserved.

Quick Links

About mePrivacyContactLandscape

Social Media