The Alter Ego of Data


The Alter Ego of Data

This post is a part of Data Governance From an Engineering Perspective, a series of posts about Data Governance and Metadata.

We are here:

  1. Introduction
    1. Data Governance From an Engineering Perspective
    2. The Alter Ego of Data (this post)
    3. Tools in the Data Management Zoo
    4. With Data Comes Responsibility
  2. Physical system
  3. Data models
  4. Business processes & Compliance

Data collection

We collect more data than earlier. No doubt about that. A wide range of tools allow us to capture, store and analyze items that 20 years ago were not considered (documents, videos, pictures, sound). Yet, how accurate the data is?

Have you ever wondered what is the difference between the fact and the stored data?

How does a transaction or a customer profile differ from the reality?

Instead of getting into philosophical discussions, let me say it straight away - we need context for data to be meaningful.

Source: https://www.memecenter.com/

You can see context as data's ecosystem. A common vocabulary and a set of relationships between components. The more we know about that ecosystem, the better we can interpret the data within it.

For example, you have a low-grade fever - 37.3 C. There is a chance you got a cold virus and you should rest

Provided you met a COVID-19 infected person a few days earlier changes the context. You have to isolate yourself now.

Another example - troubleshooting the infamous blue screen of death.

Source: https://www.flickr.com/photos/9704498@N05/3383841393/

By looking at the error message you won't solve the problem. To find out the root cause of the error, you want to look into:

  • Windows logs
  • Running processes
  • Memory dumps
  • User activity
  • Hardware logs

Gathering surrounding data and making a timeline of events shows potential causes. The more context and metadata you get, the more meaning you extract.

    Data unanswered questions

    Metadata is data about data. It is a description and context. 

    Metadata is as important as the actual contents. How can you trust the contents if you don't know who collected it, how and why?

    Look at a sales table below:

    We know the answers to some of the questions:

    • How much did John paid for a a bike? Answer: 800 (euros, dollars, kroner?)
    • How many T-shirts were sold on the 10th of August? Answer: 5
    • How much revenue did sales generate on the 9th of August? Answer: 24.20
    • You get the point... :)

    While other questions are impossible to answer:

    • Developer: What systems will I affect if I rename "Total price" column to "total_price"?
    • Business person: What is the definition of "Product" column? Does it correspond to an internal product hierarchy?
    • Data architect: What customer sensitive data it contains?
    • Auditor: How was the "Total price" calculated? Does it contain tax?
    • Data Warehouse Architect: What is the source system and how often does it refresh?
    • Security engineer: Who has access to the data?
    • Business person #2: Who is the product owner?

    Technical metadata example

    Assume there is a system table with data types and descriptions.

    Now, we know a bit more about our sales table: 

    • Column Data types - specified during table creation (possibility to use schema discovery)
    • Column Description - populated by users

    Next, assume there is table level information.

    How to put it place?

    Suppose we use Azure SQL or SQL Server as our database. Both have a useful feature called "Extended properties".

    Extended Properties allow to store more information about database objects. Imagine this as comments on:

    • databases
    • tables
    • views
    • triggers
    • constraints
    • columns
    • stored procedures
    • functions
    • indexes.

    For now, remember that each row in your business table has hidden data:

    Check out for more examples on: https://dataedo.com/kb/data-glossary/what-is-metadata

    Metadata problems

    As I mentioned earlier, the more context and metadata you get, the more meaning you extract. 

    Storing all the surrounding data has drawbacks too:

    • Violation of data privacy and the right to be forgotten
    • Storage space isn't indefinite (less of a problem with cloud)
    • Complex and different formats, structures 

    Also, as the series focus on an engineering perspective, here a few technical concerns:

    • How to keep in sync and up-to-date metadata?
    • Is there a user friendly UI or API to manage metadata?
    • Who populates it - technical teams or business users?
    • I use X database and Y ETL tool - do I need a custom build solution?
    • Source systems don't have column descriptions. Should we populate it manually?

    "To understand recursion metadata, you must first understand recursion metadata!"


    About author

    Hi! I am Valdas Maksimavičius. I specialize in data analytics and cloud computing with ten years of experience. I have been using Azure Cloud components since 2014.

    For the last five years, I have been leading Data Engineering teams using the latest Azure Data and AI services. I worked on Data Lake and Data Science platform implementations for various sectors in the Nordics. Check out my personal blog.


    I plan to release other posts in the future. If you like the topics, sign up to get notified about new posts.

    Any feedback, opinions and suggestions are highly welcome!

    .