We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Integration on the Edge: Data Explosion & Next-Gen Integration

Hollis Tibbetts

Fixing a $3 Trillion Dirty Data Problem with "Crowd Computing"

user-pic
Vote 1 Vote

Yes.  "Crowd Computing", not "Cloud Computing".  That wasn't a typo. 

Bad data is a big problem.  An enormously expensive problem.  At this point, most people should be nodding their head - just as if I had said something obvious and true - like "the world is round" or "political advertisements on television are irritating especially during a campaign year".  

And technology has been incredibly ineffective at addressing this dirty data problem.

But there's a new weapon in your arsenal to fight this problem - and you might not be aware of it.  It's brand new - a technological advance.  It's called "people" - sometimes called "the crowd".   

It's actually a very innovative way to use technology to harness human beings as a way of solving problems that ONLY human beings can solve. 

It's a scalable and elastic approach to using people.  It's somewhat like applying Cloud Computing principles to human beings - maybe it should be called "Crowd Computing".  And it may be exactly what you need.

How Bad is the Problem?

In survey after survey, about half of IT executives consistently agree that data quality and data consistency is one of the biggest roadblocks to them getting full value from their data.

This has been consistently true all since the Chinese invented the abacus. I suspect it will be true long after quantum computing has solved every other problem that humanity faces.

 Incorrect, inconsistent, fraudulent and redundant data cost the U.S. economy over $3 Trillion a year - an astounding figure that is over twice the amount of the 2011 U.S. Federal Deficit.

Using Technology to Attack Bad Data

I've worked with companies over the years to help them chip away at the bad data problem.  Yet, the problem continues to get worse - because of the ever increasing number of systems out there, and the staggering growth in the amount of corporate data - typically about 40% a year.

I've written about some techniques and best practices for ameliorating the bad data situation with technology  - with techniques such as data profiling engines, data matching, data de-duplication algorithms and so on.     

Technology CAN'T Solve Bad Data:

In discussions with clients, these conversations about using technology to resolve bad data inevitably lead to the same point: "...and THEN what do I do???"  

Let me attempt to reconstruct a typical conversation between a "data consultant" and a "CIO with a data problem" (note: for entertainment purposes, I've added some color, but conceptually, the conversation is highly accurate).

Data Consultant:                I've got some technology that uses fancy tunable algorithms to score your data.  Run your data through this nifty software magic black box and it will magically sort your data into three piles:

  1)   Highly compliant and clean (as far as these tests can    determine);

  2)    Definitely non-compliant and almost certainly dirty;

  3)    Not really sure... probably compliant but not really sure.

Customer with Bad Data:    Can you give me an idea of how big those three piles of data will be?

Confident Data Consultant:  Well, that depends on how you tune the algorithms, and how bad your data are.  Your mileage may vary tremendously.

                                         But don't be surprised if 70% of your data get scored as clean, and 30% of your data get scored as dirty, duplicate or "unsure".

Concerned Customer:         And then what does your magic software box do with the 30%?

Backpedalling Consultant:  Well, nothing. It just creates the piles. But there are certain things we can do to automatically remediate some of the data.  

Very Concerned Customer: And how much will that fix things?

Timid Consultant:               Oh, that depends on a lot of things - really impossible to say, but maybe a quarter to as much as two-thirds...if I had to make a guess.  The rest will require manual intervention.

Agitated Customer:             But that leaves 10 to 25% that is "unfixable".  I have a hundred thousand, (or a million or 50 million) items that need validation and cleaning.  There's absolutely NO WAY I can manually check that amount of data!

Scared Consultant:             Well, look on the bright side, at least you know what data are good and what are "who knows". Doesn't that make you a LOT better off than you were before?

Outraged Customer:           No, you idiot...now I'm worse off.  You're trying to convince me to  spend a lot of money to identify a problem that I can't fix. Worse yet, I'll be putting a spotlight on on the problem!  The CEO will blame me.  And when I can't fix it, I'll get fired.  I'd rather not do anything. Get out of my office now! 

Unemployed Consultant:     Well, it was great meeting you, and I'll call you next week to discuss "next steps".  

Why Technology Can't Do the Job

The problem with technology is that it's good enough to put a spotlight on the bad data problem, but not good enough to actually resolve the problem.  

Data are valuable.  You can't tell some algorithm that at a confidence score of 49.999% a particular data item should be erased and at 50.000% a data item should be declared "clean".

Companies typically want very high levels of confidence in a data item for it to be declared "clean".  And if a data item is to be automatically eliminated, the level of confidence should be exceptionally high.  When it comes to automated remediation via software, confidence levels must also be extremely high. 

Given the state of technology today, these issues just cannot be fully addressed by software alone.  They need human intervention. 

Granted, IT departments are full of humans, but these aren't humans who are available for doing data judgment and remediation work.  And even if they were - a handful of IT people can't fix 100,000 (or a million) questionable data items.  And even if they could, they wouldn't be happy about being forced to do so. 

Human Crowd Computing

Just like Amazon has developed a software and hardware infrastructure (EC2) for instantly available, incredibly scalable, elastic and "pay for what you use" computing power, a new category of solution infrastructure is emerging that does the same - using people instead of CPUs.

It involves a very large (many thousands to millions of people) internet-connected workforce which I will refer to as "contributors".  It also requires a software platform to enable this crowd of "opt-in" contributors to work in parallel on simple (for humans, but not computers) repetitive non-specialized tasks. 

The right design with automated oversight, multiple reviewers for each task and automated scoring/rating of individual contributors ensures highly accurate results. 

A Credible Solution for a Big Problem

In many cases, this model of "massively scalable and elastic opt-in workers" plus a sophisticated software platform should be the preferred solution for dirty data issues.  In other cases, it is the ideal "last mile of the marathon" for solving data issues in conjunction with some of the software tools and techniques (like Data Profiling) that I've written about in other articles,

If you have data issues, you absolutely should be investigating this Crowd Computing technology. 

On Wednesday, I'll be publishing a companion article to this one.  It drills down into the "use cases" where the human crowd computing is highly valuable in solving data issues.  

More importantly, I'll review some real-life success metrics from a large-scale deployment of crowd computing at a Fortune 100 company - using the CrowdFlower Enterprise Crowdsourcing Platform.  San Francisco-based CrowdFlower is the leader in this new and growing space, with a community of some 2 million opt-in contributors.

No TrackBacks

TrackBack URL: http://www.ebizq.net/MT4/mt-tb.cgi/18289

This blog offers an informed and informative perspective on the ongoing explosion of data and the technologies used to turn this data explosion into assets and competitive advantages.

Hollis Tibbetts

Hollis has established himself as a successful software marketing and technology expert. His various strategy, marketing and technology articles are read nearly 50,000 times a month. He is currently Director for Global Marketing Operations for Dell Software Group. Hollis has developed substantial expertise in middleware, SaaS, Cloud, data management and distributed application technologies, with over 20 years experience in marketing, technical, product management, product marketing and business development roles at leading companies in such as Pervasive, Aruna (acquired by Progress Software), Sybase (now SAP), webMethods (now Software AG), M7 Corporation (acquired by BEA/Oracle), OnDisplay (acquired by Vignette) and KIVA Software (acquired by Netscape). He has established himself as an industry expert, having authored a large number of technology white papers, as well as published media articles and book contributions. Hollis is a top-ranked author on Sys-Con media, is also published on Social Media Today "The World's Best Thinkers on Social Media", and maintains a blog focused on creating great software: Software Marketing 2013. He tweets actively as @SoftwareHollis Additional information is available at HollisTibbetts.com All opinions expressed in the author's articles are his own personal opinions vs. those of his employer.

Recently Commented On

Monthly Archives

Blogs

ADVERTISEMENT