Data problems - whether they be inaccurate data, incomplete data, data categorization issues, duplicate data, data in need of enrichment - are age-old.
IT executives consistently agree that data quality/data consistency is one of the biggest roadblocks to them getting full value from their data. Especially in today's information-driven businesses, this issue is more critical than ever.
Technology, however, has not done much to help us solve the problem - in fact, technology has resulted in the increasingly fast creation of mountains of "bad data", while doing very little to help organizations deal with the problem.
One "technology" holds much promise in helping organizations mitigate this issue - crowdsourcing. I put the word technology in quotation marks - as it's really people that solve the problem, but it's an underlying technology layer that makes it accurate, scalable, distributed, connectable, elastic and fast. In an article earlier this week, I referred to it as "Crowd Computing".
Crowd Computing - for Data Problems
The Human "Crowd Computing" model is an ideal approach for newly entered data that needs to either be validated or enriched in near-realtime, or for existing data that needs to be cleansed, validated, de-duplicated and enriched. Typical data issues where this model is applicable include:
- Verification of correctness
- Data conflict and resolution between different data sources
- Judgment calls (such as determining relevance, format or general "moderation")
- "Fuzzy" referential integrity judgment
- Data error corrections
- Data enrichment or enhancement
- Classification of data based on attributes into categories
- De-duplication of data items
- Sentiment analysis
- Data merging
- Image data - correctness, appropriateness, appeal, quality
- Transcription (e.g. hand-written comments, scanned content)
In areas such as the Data Warehouse, Master Data Management or Customer Data Management, Marketing databases, catalogs, sales force automation data, inventory data - this approach is ideal - or any time that business data needs to be enriched as part of a business process.
Human Crowd Computing is NOT Outsourcing or "Hiring Temps"
Human Crowd Computing is completely different than outsourcing the problem or hiring a large number of temporary workers.
Human Crowd Computing is instantly scalable - up and down. Outsourcing is the equivalent of renting some other company's data center. And "hiring temps" is the equivalent of bringing in a temporary data center. Both approaches take time to "turn on". They can't scale "up" very well. And they're not elastic. And you pay for the resource whether you use its full capacity or not.
CrowdFlower - Scalable, Elastic Human Computing
I'm most familiar with a San Francisco CA-based firm called CrowdFlower that fits the description of Human Crowd Computing.
It consists of a software platform that includes a workflow engine, quality monitoring and "contributor" rating that manages the distribution of work across a community of 2,000,000 "active contributors" in dozens of countries across the world.
At each step in the workflow, multiple workers' (or "contributors") judgments are algorithmically aggregated to one trusted answer based the contributor's individual accuracy. Individual contributor accuracy ratings are assessed in a competition-style model. At the random points, data are audited by "gold standard" workers to ensure accuracy and quality.
In a verification study done with a leading digital media company to verify, correct and enrich business listings, the CrowdFlower platform was able to raise accuracy levels of data from typically 75% to over 99%.
The CrowdFlower implementation of Human Crowd Computing is highly effective, and proves out the applicability of this model for a wide variety of data verification, enrichment, cleansing and remediation projects.
A Leading Online Marketplace and Human Crowd Computing
A second proof point of this type of technology is an example of an implementation at a leading online marketplace, which has hundreds of millions of listings live at any given moment.
This marketplace has an incredible variety of items listed - in the past, those items have included old gum, entire towns, and even spouses. The fact that anyone can list almost anything makes this marketplace the place to go to find rare or outlandish items.
Major Product Categorization Problems
It's no doubt, then, that one of the biggest challenges this marketplace faces is product categorization. Product categories are a key way that people search for items.
Depending on the month, this marketplace requires upwards of 100,000 new products to be categorized into something called a Global Trade Item Number - a unique 12-14 digit number based on product information which typically must be gathered from multiple different sources.
Depending on the month, the number of products requiring categorization ranges from below 5,000 to close to 100,000. A scalable and elastic computing model is required to support the variations in workload.
Because judgment calls are involved, and data must be retrieved and compared from potentially many different sources, the CrowdFlower platform uses multiple humans for each judgment call to ensure high levels of accuracy. About 60% of categorizations are completed with 2 or 3 individual responses; however, particularly complex judgment calls can require 10 or more responses. I've confirmed that this algorithm is quite tunable - if your data needed higher levels of certainty, you would simply involve more human opinions, enabling you to achieve the goals you require.
The marketplace formerly outsourced product categorization - essentially paying for a large staff of contractors which were alternately overwhelmed and then idle, depending on the day. From week to week, there could be as much as a 400% difference in workload.
With a Human Crowd Computing platform, the marketplace increased its throughput for product categorization by over 300% - from 300 per hour to 1,000 per hour. At the same time, the number of improper classifications were reduced by over 67%. To cap it off, CloudFlower claims that this solution reduced the marketplace's costs by some 70%.
CrowdFlower has published a nice 7-page customer success brief on one of their larger customers that is worth reading. I can't link to the report directly, but if you go to their home page and click on the "get a free report" button, you'll get it via e-mail within about 30 seconds - after you answer 3 or 4 pesky questions.
Without question, this model of Human Crowd Computing will become increasingly mainstream in organizations. It's highly appropriate for any situation involving large to huge numbers of small tasks that require human judgment. With the appropriate software platform, the internet and commonly available connectivity/interoperability software, this solution may be exactly what you need for your data problem.
Although I can't personally testify to the reduction in costs that the marketplace experienced, I have little doubt that there was a significant reduction in the "cost per categorization" metric. Furthermore, I am highly confident in the massive scalability of the elastic computing model they employed, and I am also highly confident in that models ability to produce quality results.
Although I've highlighted CrowdFlower as an example, this pair of articles isn't meant to be about CrowdFlower - it's about a new model of leveraging large, distributed opt-in communities of workers who are all connected over the internet and are managed by sophisticated workflows and accuracy ranking systems.
But as an innovator in this space with many examples of successful implementations and over 500 current customers using the platform, CrowdFlower makes for an excellent example of how organizations can solve their data problems using this new approach.