In survey after survey, about half of IT executives consistently agree that data quality and data consistency is one of the biggest roadblocks to them getting full value from their data.
This has been consistently true all the way back to Greek and Roman days. I suspect it will be true in 100 years.
Similarly, many experts estimate that HALF the money spent on developers goes towards "software repair".
So we're living in a world of sick software and dirty data. A fine mess we've gotten ourselves into. And the cost of all this is staggering.
I've long been a proponent of rapid, iterative, and continuous testing. This concept holds true for both data and applications.
This rapid, iterative, continuous testing model has measurably improved the quality of software development. Evangelists such as Kent Beck have had a huge impact on this. I recently posted a freely downloadable white paper on this topic.
But where are the evangelists for data quality? Where is an open source "JUnit for Data" and if it's out there, why isn't everyone using it?
As data are added or integrated, data should be tested. Profiling is a simple, fast, relatively easily implemented and highly effective way for eliminating significant volumes of defective data.
When developers write a new application for the input of some new data, it's normal for input fields to be "validated" - a simple "hard coded" form of profiling. Month number needs to be between 1-12. Not rocket science. And it's universally done.
Yet people have far fewer reservations about integrating data from here, there and everywhere - often not checking for even the most egregious data errors, and thereby polluting the organizational drinking water.
Data profiling engines are a great technology for quickly improving the quality of data as it is integrated from one system into another. At the highest level, they are an engine that scans data, and applies certain easily definable rules to data elements, such as formats, ranges, allowable values and can evaluate relationships between different fields.
Furthermore, these engines can also be used to analyze existing data stores very rapidly and generate "exceptions files" for manual, or semi-automated remediation (if anyone can find a totally automated data remediation system, I'd love to know about it). So they can be used in "continuous testing" or "batch testing" mode.
I've never understood why these engines haven't been more popular. There is no "JUnit for data" as far as I know. But commercial solutions are available - they're not terribly expensive and rapidly pay for themselves.
On the other hand, I've never understood why organizations are so tolerant of bad, dirty data. They waste millions directly because of it (and untold quantities of money in "wasted opportunities"), but are reluctant to spend $15,000 to help fix a significant portion of it.