Technology Trends for 2008: Data Management Predictions

Recent hardware trends - including many-core chips, grid topologies built from commercial off-the-shelf components, and custom compute appliances - are leading to new economies and paradigms for highly scalable and reliable computing. Coupled with these trends are growing customer expectations for more powerful, highly scalable architectures that run across these new topologies and deliver more business volume and velocity. From a customer viewpoint, state-of-the art data management systems should not merely be agnostic to running on these different hardware topologies but should also exploit unique advantages of underlying hardware resources.

In the quest for massive data scalability, performance, and reliability, as a data management vendor, we anticipate an overall trend this year toward end-to-end system architectures based on data partitioning and dataflow parallelism, creating elastic data-data management architectures capable of adapting and scaling to existing and future business data requirements.

Based on hardware trajectories and customer appetite for larger and larger data processing, we predict the following trends for 2008:

Trend 1: Creation of TB scalable data management systems that are based entirely on main-memory storage.

Main-memory data management approaches are becoming mainstream in highly competitive business domains like the finance vertical, where low-latency, extreme transaction processing is essential for competitive advantage. Customers are inevitably looking to scale-out their current memory-based systems. There are numerous ways to increase scalability:

(1) Vertically scale-up using state-of-the-art many-core hardware coupled with ample memory
(2) Replicate data to another machine and load balance processing
(3) Partition data by dividing twice the work across two machines
(4) Compress data so more of it can fit into memory storage.

An example of (1) could be replacing a medium-powered server machine with a custom compute appliance like the Azul system for running Java applications. Hardware economics and application contention for shared resources measured through performance benchmarks will ultimately decide if this is a viable strategy for scalability and on-demand capacity management. This scalability approach, however, offers no secondary benefit of improved fault tolerance through distribution. Also, beefing up a node's memory and CPU resources does not necessarily increase network bandwidth which might be the scalability bottleneck.

Regarding (2), replication of data servers can improve data processing throughput and latency as well as increase the overall number of network clients; however, replication does not scale data volume.
Most noteworthy, simple analytical models warns how with eager (all replicas updated within the same transaction) group (any peer in the group can update data anytime) replication deadlocks rise to the third power of the number of nodes and to the fifth power of the transaction size.

Compression of data (3) increases data storage capacity so long as the codec overhead does not outweigh the benefit of being able to cache more data in process. Scalability through partitioning (4) splits up data, processing, and client load across nodes. This approach offers wide scale-out as nodes can be added to accommodate new applications and data volume. Partitioning, however, can slow down processing unless the underlying system can intelligently co-locate data with application logic.

Each of these scalability approaches has merits and caveats. We predict the emergence of system architectures that combine all these methods creating an approach that offers the best overall flexibility and scalability, creating multi-TB systems that work entirely in memory. This synergistic architecture will likely begin with data partitioning as the fundamental underlying strategy for both ad-hoc and planned extensibility. To load balance processing and insure fault tolerance, data partitions are replicated to at least one backup. Compression of data objects on partitions maximizes data memory storage density per node especially when partitions group related data into columnar (vertical) data structures where all data within a column is a single type. The distributed system is then deployed on many-core big iron where network latency and I/O bandwidth is improved by a high performance backplane. At the macro-level, multi-site system deployments share data with each other through a WAN gateway, partitioning site specific relevant data across geography:

This combined scalability architecture can be run on any hardware topology. For example, if price-performance analysis does not justify the purchase of expensive big iron, the same architecture can be overlaid on to a grid of COTS machines.

Trend 2: Re-emergence parallel databases and a movement to shared-nothing topologies.

An inevitable outcome of highly scalable architectures is that data will be partitioned onto nodes connected as a distributed system. On the upside, networking many nodes together allows a distributed system to scale incrementally as it provides applications with additional processors and storage locations. Distributed systems, on the other hand, also pose fundamental challenges, viz., how to co-locate related data together with application logic to insure fast processing; how to insure efficient transactional updates across nodes; how to manage application consistency requirements and high availability (especially in an asynchronous environment where nodes can fail and messages can be delivered and applied in non-deterministic time). When designing scalable architectures, today's system designers have the hindsight of over 30 years of parallel database design experience. The parallel database community has grappled with many of these questions resulting in collective wisdom about three historical approaches spanning from shared memory to shared-disk to shared-nothing architectures. Both shared-memory and shared-disk architectures encounter fundamental scalability limitations because all I/O and memory requests must be transferred over a bus shared by all processors or between CPUs and disk networks. Scalability can also be limited by distributed locking between processor nodes since coordination requires complex, message-based distribution protocols. Shared-nothing is generally regarded as the best scaling architecture since data is partitioned across storage at each processor node allowing multiple processors to process large data sets in parallel. In addition, nodes maintain complete ownership over their data so complex, distributed locking protocols are not required.

Trend 3: Emergence of dataflow parallelism leading to a SOA development pattern based on independent entities cooperating via workflow.

A dataflow approach to distributed data management enables both pipelined and partitioned parallelism. Operators can be composed into parallel dataflow graphs. By streaming the output of one operator into the input of another, multiple operators can work in unison creating pipelined parallelism. By partitioning data into memory on multiple nodes, a task can be split into many independent operators each working on a part of the data. This is known as partitioned parallelism. Dataflow is a perfect match for shared-nothing architectures since partitions can be deployed on individual nodes where local data can be independently owned, modified, and queried. At the highest level, this isolation of data into partitions leads to an emerging service-oriented architecture pattern where partitions behave as abstractions called service entities. A service entity is a single owner of a collection of data-a holistic chunk of the partitioned business data domain. A service entity is the only actor that can modify its collection of data, that is, data within a service entity evolves independently of data within another service entity; the scope of a transaction is limited to one service entity and transactions cannot span service entities.
Service entities communicate with each other via messaging queues. Hence the service entity pattern enables a workflow model where input to one service entity can be pipelined to downstream service entities.

We anticipate a dataflow approach to scalability will result in a development pattern where high-level service entities work autonomously, are transactionally independent, and provide system-wide parallelism implemented via workflow queues. Effective use of dataflow and a service entity pattern are prescribed by the following developer guidelines for scalability:

1. Given that service entities are isolated data domains, normalization of data models is vital to allow independent evolution of data within service entities. This means data modelers will create globally unique identifiers so data items can be modified independently from each other and be resolved when needed by id.

2. Partitioning defines logical boundaries for service entities so agreement about what data items should be grouped together within the same service entity will be established and well-defined.

3. It is a fact of nature that globally serializable transactions are not scalable across nodes in a distributed system. Distributed locking slows down applications by introducing latency. Instead of distributed transactions, we expect system architects to use workflow between service entities.

4. Dataflow enables system-wide parallelism, so applications will be designed to be as embarrassingly parallel as possible.

5. Repeated query patterns can be transformed from request-response polling to stateful publish-subscribe views driven by dataflow. For example, event driven applications that perform stream processing will be easily converted to a dataflow view.

Given these high level system architecture trends for 2008, we see the following concrete impacts on the market for extreme transaction processing services and products:

*Move away from traditional disk-based OLTP databases in favor of scalable memory based systems.

*Merger of OLTP and OLAP solutions. Real-time OLAP systems will emerge that provide cube views that are updated as underlying OLTP tables change.

*Rise of cloud computing offerings. In addition to ASPs offerings like Amazon S3 and EC2 as well as IBM's recent cloud computing announcement, we anticipate traditional database vendors will jump on the bandwagon and realize that databases can become a software-as-a-service and provide their own cloud computing offerings. Smaller data vendor players might get into the cloud computing game too, creating commoditized Google-like infrastructure based on Yahoo's open source Hadoop project.

More by Russell Okamoto

About GemStone Systems

GemStone Systems is an enterprise software company that has pioneered the adoption of several enterprise infrastructure technologies through its rich history. GemStone’s products continue to be used by over 200 large enterprise customers in mission-critical environments in critical enterprise-wide deployments in industries such as financial services, the Federal Government, transportation, telecommunications and energy. GemStone Systems has built deep expertise in enterprise-level technologies such as distributed resource management, in-memory caching and disk persistence, scalable data distribution, high performance computing operations, object management and other areas that are at the core of building a highly reliable, mission-critical data infrastructure. GemStone brings together a unique set of skills and talented people who have pioneered these technologies through its history and those that have successfully driven the creation of markets and products in other successful enterprise software companies. GemStone has also build enterprise-class support models that allow global enterprises to rely upon GemStone as a solid business partner today, with experience dating several years.