When it comes to data, the workhorse relational database has been the tool of choice for businesses for well over 20 years now. Challengers have come and gone but the trusty RDBMS is the foundation of almost all enterprise systems today. This includes almost all transactional and data warehousing systems. The RDBMS has earned its place as a proven model that, despite some quirks, is fundamental to the very integrity and operational success of IT systems around the world.
However, as I discussed earlier this year in my 10 Must-Know Topics for Software Architects in 2009 post, the relational database is finally showing some signs of age as data volumes and network speeds grow faster than the computer industry's present compliance with Moore's Law can keep pace with. The Web in particular is driving innovation in new ways of processing information as the data footprints of Internet-scale applications become prohibitive using traditional SQL database engines.
When it comes to database processing today, change is being driven by (at least) four factors:
- Speed. The seek times of physical storage is not keeping pace with improvements in network speeds.
- Scale. The difficulty of scaling the RDBMS out efficiently (i.e. clustering beyond a handful of servers is notoriously hard.)
- Integration. Today's data processing tasks increasingly have to access and combine data from many different non-relational sources, often over a network.
- Volume. Data volumes have grown from tens of gigabytes in the 1990s to hundreds of terabytes and often petabytes in recent years.
One of the most discussed new alternatives at the moment is Hadoop, a popular open source implementation of MapReduce. MapReduce is a simple yet very powerful method for processing and analyzing extremely large data sets, even up to the multi-petabyte level. At its most basic, MapReduce is a process for combining data from multiple inputs (creating the "map"), and then reducing it using a supplied function that will distill and extract the desired results. It was originally invented by engineers at Google to deal with the building of production search indexes. The MapReduce technique has since spilled over into other disciplines that process vast quantities of information including science, industry, and systems management. For its part, Hadoop has become the leading implementation of MapReduce.
While there are many non-relational database approaches out there today (see my emerging IT and business topics post for a list), nothing currently matches Hadoop for the amount of attention it's receiving or the concrete results that are being reported in recent case studies. A quick look at the list of organizations that have applications powered by Hadoop includes Yahoo! with over 25,000 nodes (including a single, massive 4,000 node cluster), Quantcast which says it has over 3,000 cores running Hadoop and currently processes over 1PB of data per day, and Adknowledge who uses Hadoop to process over 500 million clickstream events daily using up to 200 nodes.
These datasets would previously have been very challenging and expensive to take on with a traditional RDBMS using standard bulk load and ETL approaches. Never mind trying to efficiently combining multiple data sources simultaneously or dealing with volumes of data that simply can't reside on any single machine (or often even dozens). Hadoop deals with this by using a distributed file system (HDFS) that's designed to deal coherently with datasets that can only reside across distributed server farms. HDFS is also fault resilient and so doesn't impose the requirement of RAID drives on individual nodes in a Hadoop compute cluster, allowing the use of truly low cost commodity hardware.
So what does this specifically mean to enterprise users that would like to improve their data processing capabilities? Well, first there are some catches to be aware of. Despite enormous strengths in distributed data processing and analysis, MapReduce is not good in some key areas that the RDMS is extremely strong in (and vice versa). The MapReduce approach tends to have high latency (i.e. not suitable for real-time transactions) compared to relational databases and is strongest at processing large volumes of write-once data where most of the dataset needs to be processed at one time. The RDBMS excels at point queries and updates, while MapReduce is best when data is written once and read many times.
The story is the same with structured data, where the RDBMS and the rules of database normalization identified precise laws for preserving the integrity of structured data and which have stood the test of time. MapReduce is designed for a less structured, more federated world where schemas may be used but data formats can be much looser and freeform.
RDBMS and Hadoop: Apples and Oranges?
Here is a comparison of the overall differences between the RDBMS and MapReduce-based systems such as Hadoop:
|Access||Interactive and batch||Batch|
|Structure||Fixed schema||Unstructured schema|
|Language||SQL||Procedural (Java, C++, Ruby, etc)|
|Updates||Read and write||Write once, read many times|
Figure 1: Comparing RDBMS to MapReduce
From this it's clear that the MapReduce model cannot replace the traditional enterprise RDBMS. However, it can be a key enabler of a number of interesting scenarios that can considerably increase flexibility, turn-around times, and the ability to tackle problems that weren't possible before.
With the latter the key is that SQL-based processing of data tends not to scale linearly after a certain ceiling, usually just a handful of nodes in a cluster. With MapReduce, you can consistently get performance gains by increasing the size of the cluster. In other words, double the size of Hadoop cluster and a job will run twice as fast, triple it and the same thing, etc.
Ten Ways To Improve the RDBMS with Hadoop
So Hadoop can complement the enterprise RDMS in a number of powerful ways. These include:
- Accelerating nightly batch business processes. Many organizations have production transaction systems that require nightly processing and have narrow windows to perform their calculations and analysis before the start of the next day. Since Hadoop can scale linearly, this can enable internal or external on-demand cloud farms to dynamically handle shrink performance windows and take on larger volume situations that an RDBMS just can't easily deal with. This doesn't elide the import/export challenges depending on the application but can certainly compress the windows between them.
- Storage of extremely high volumes of enterprise data. The Hadoop Distributed File System is a marvel in itself and can be used to hold extremely large data sets safely on commodity hardware long term that otherwise couldn't stored or handled easily in a relational database. I am specifically talking about volumes of data that today's RDBMS's would still have trouble with, such as dozens or hundreds of petabytes and which are common in genetics, physics, aerospace, counter intelligence and other scientific, medical, and government applications.
- Creation of automatic, redundant backups. Hadoop can then keep the data that it processes, even after it it's been imported into other enterprise systems. HDFS creates a natural, reliable, and easy-to-use backup environment for almost any amount of data at reasonable prices considering that it's essentially a high-speed online data storage environment.
- Improving the scalability of applications. Very low cost commodity hardware can be used to power Hadoop clusters since redundancy and fault resistance is built into the software instead of using expensive enterprise hardware or software alternatives with proprietary solutions. This makes adding more capacity (and therefore scale) easier to achieve and Hadoop is an affordable and very granular way to scale out instead of up. While there can be cost in converting existing applications to Hadoop, for new applications it should be a standard option in the software selection decision tree. Note: Hadoop's fault tolerance is acceptable, not best-of-breed, so check this against your application's requirements.
- Use of Java for data processing instead of SQL. Hadoop is a Java platform and can be used by just about anyone fluent in the language (other language options are coming available soon via APIs.) While this won't help shops that have plenty of database developers, Hadoop can be a boon to organizations that have strong Java environments with good architecture, development, and testing skills. And while yes, it's possible to use languages such as Java and C++ to write stored procedures for an RDBMS, it's not a widespread activity.
- Producing just-in-time feeds for dashboards and business intelligence. Hadoop excels at looking at enormous amounts of data and providing detailed analysis of business data that an RDBMS would often take too long or would be too expensive to carry out. Facebook, for example, uses Hadoop for daily and hourly summaries of its 150 million+ monthly visitors. The resulting information can be quickly transferred to BI, dashboards, or mashup platforms.
- Handling urgent, ad hoc requests for data. While certainly expensive enterprise data warehousing software can do this, Hadoop is a strong performer when it comes to quickly asking and getting answers to urgent questions involving extremely large datasets.
- Turning unstructured data into relational data. While ETL tools and bulk load applications work well with smaller datasets, few can approach the data volume and performance that Hadoop can, especially at a similar price/performance point. The ability to take mountains of inbound or existing business data, spread the work over a large distributed cloud, add structure, and import the result into an RDBMS makes Hadoop one of the most powerful database import tools around.
- Taking on tasks that require massive parallelism. Hadoop has been known to scale out to thousands of nodes in production environments. Even better, It requires relatively little innate programing skill to achieve since parallelism is an intrinsic property of the platform. While you can do the same with SQL, it requires some skill and experience with the techniques. In other words, you have to know what you're doing. For organizations that are experiencing ceilings with their current RDBMS, you can look at Hadoop to help break through them.
- Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment. Done right -- and there are challenges depending on what your legacy code wants to do -- and Hadoop can be used as a way to migrate old, single core code into a highly distributed environment to provide efficient, parallel access to ultra-large datasets. Many organizations already have proven code that is tested and hardened and ready to use but is limited without an enabling framework. Hadoop adds the mature distributed computing layer than can transition these assets to a much larger and more powerful modern environment.
Hadoop is growing in leaps and bounds these days and there are additions or extensions to meet a wide variety of needs. Just a few examples include Pig, a "platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs" that enables "embarassingly parallel" data analysis tasks. There is HBase, which enables random, read/write access to large datasets, something difficult to do with Hadoop itself. Also worth mentioning is HadoopDB which is an attempt to more explicitly combine the methods of MapReduce and the relational database.
Whichever the case, while Hadoop is still an emerging technology, what currently makes it special is its proven ability to perform well across numerous problem domains in a large number of well-known production applications. This makes it likely that you'll see this approach considered more and more often in next-generation enterprises. I'll keep track of it here as I can, in the meantime, I recommend spending some time understanding how Hadoop works and getting a feel for what it can do in your enterprise when you need it.
Are you planning to use Hadoop in your organization? Why or why not?