Picture the scene: you arrive at work bright and early, switch on your pc and settle down to do some important work confident that the data you need will be ready and waiting for you. But I have a question for you: Do you know how that data gets there and where is comes from?
Over the years I have come across numerous situations that run something along these lines:
Team A: Our data is loaded up by IT
IT: No we don't touch that data, it's a manual data load by Team B
Team B: We just send the spreadsheet to Team A - we're sure that they load the data
Team A: No we really don't load up that data…
A friend and colleague of mine, Justin York, has coined the phrase "data fairies" for precisely such situations and it seems to describe the situation accurately. For many people this is the truth of the situation and whilst they may not in fact believe in the data fairies, they do not know how they get their data or even sometimes where it comes from. Sadly often, they do not care.
I was lucky enough to hear Peter Aiken speak earlier this year and he summed up the situation perfectly with the following analogy: most people consider data in the same way that they think of air, in that they don’t consider it all, just assume that it will always be there and will always be good enough to use. It’s only when something goes wrong that we consider the quality of the data that we are using. And I would add to that, even then do we think about where the data comes from?
All too often I find that the answer to this question is no. Having been forced to consider the quality of the data, since it is not good enough to use, the usual reaction is to go for a tactical fix of the data in front of you. After all if you don’t know where the data comes from or how it got there, how can you consider a more permanent, strategic solution? Unfortunately this typical attitude starts a whole cycle of constant tactical data cleanses or fixes that soon become part of the usual way of working. I think that anyone who has the opportunity to step back and consider this would agree that the situation makes no sense at all, but if you don’t know where your data comes from, how do you go about finding the source of the problem and fixing it?
This is why data lineage (knowing where the data comes from and what happens to it on its journey to you) is so important. It easily enables the source of a problem to be traced back to its origin and a more permanent solution considered, if it is appropriate and cost effective to do so.
Unfortunately documenting data lineage for a whole company from scratch is not a small task and unless there is a regulatory requirement for it (such as Solvency II), you are unlikely to find support for a wholesale project to investigate and document data lineage. Instead I recommend an iterative approach to slowly build up a repository of data lineage information. Use what you find out when investigating data quality issues and document that for reuse in the future. Adding to such a repository over time, will get your data lineage started and perhaps using it to more easily resolve issues, will help you build a business case for documenting your company’s most critical data items in the future.
So upon reflection in keeping with the true traditions of a good fairy tale, it occurs to me that my story should also have a sub-title. Therefore, I do hope you have enjoyed reading “Do you believe in the Data Fairies (or the Importance of Data Lineage).
So boys and girls, I do hope you will join me in a few weeks time for my next story!
My free report reveals why companies struggle to successfully implement data governance. Discover how to quickly get you data governance initiative on track by downloading this free report