Do you Believe in the Data Fairies?
/This is an updated version of a blog I first wrote back in August 2012. Fourteen years on, I'm genuinely surprised by how little has changed — and how much more urgent the problem has become.
Picture the scene: you arrive at work, open your laptop, and settle down to do some important work, confident that the data you need will be ready and waiting for you. But here's my question: do you know how that data gets there, and where it comes from?
Over the years I've come across countless situations that run something like this:
Team A: Our data is loaded up by IT.
IT: No, we don't touch that data — it's a manual load by Team B.
Team B: We just send the spreadsheet to Team A. We're sure they load it.
Team A: No, we really don't load that data…
My friend and colleague Justin York coined the phrase "data fairies" for precisely these situations, and it describes them perfectly. For many people, the data simply appears and whilst they may not literally believe in the data fairies, they have no idea how they get their data, or sometimes even where it comes from. Sadly, they often don't care either.
I was lucky enough to hear Peter Aiken speak years ago, and he summed the situation up with a brilliant analogy: most people think about data the same way they think about air. They don't think about it at all. They just assume it will always be there, and always be good enough to use. It's only when something goes wrong that they stop to consider the quality of what they're working with.
And I'd add: even then, do they think about where the data comes from?
All too often, the answer is no. Having been forced to acknowledge a data quality problem, the usual reaction is a tactical fix of the data in front of you. After all, if you don't know where the data came from or how it got there, how could you possibly consider a more permanent solution? And so begins a cycle of constant data cleanses and patches that quietly become part of the normal way of working — nobody questions them, they just become "the process."
This is why data lineage (knowing where your data comes from and what happens to it on its journey to you) matters so much. It allows problems to be traced back to their source, and permanent solutions to be properly considered.
Now here's where 2026 changes everything.
When I wrote this post in 2012, the stakes of not knowing your data lineage were significant but manageable. Poor decisions, regulatory headaches, wasted hours fixing bad reports. Not ideal, but survivable.
In the age of Gen AI, the stakes are categorically higher.
Organisations are now feeding data into LLMs to drive automated decisions, about customers, about risk, about operations. If you don't know where that data comes from, you cannot know whether your AI is trustworthy. Garbage in, garbage out has always been true, but when the "out" is an automated lending decision, a medical recommendation, or a fraud flag, the consequences are very different from a slightly wrong internal report.
There's a cruel irony here too. Many organisations are turning to AI to help with data quality and data lineage. It can genuinely provide some help for these activities, but AI cannot magic good data governance into existence. It can help you document lineage faster, flag anomalies more quickly, and surface patterns a human might miss. What it cannot do is substitute for knowing, at a fundamental level, what your data is and who is responsible for it.
The data fairies are not just still with us, they're now feeding the models.
My advice remains the same as it was in 2012, but it's more urgent: take an iterative approach to building up your data lineage. Use every data quality incident as an opportunity to document what you learn. Build a repository, piece by piece. And critically, make sure your AI governance and your data governance are not two separate conversations - because they cannot be.
The data fairies aren't going away on their own. In the age of Gen AI, understanding your data has never mattered more. My book Effective Data Governance (Kogan Page, 2026) gives you the framework to tackle all of this. It's available now.
