Recent hardware trends - including many-core chips, grid topologies built from
commercial off-the-shelf components, and custom compute appliances - are leading
to new economies and paradigms for highly scalable and reliable computing. Coupled
with these trends are growing customer expectations for more powerful, highly
scalable architectures that run across these new topologies and deliver more business
volume and velocity. From a customer viewpoint, state-of-the art data management
systems should not merely be agnostic to running on these different hardware topologies
but should also exploit unique advantages of underlying hardware resources.
In the quest for massive data scalability, performance, and reliability, as
a data management vendor, we anticipate an overall trend this year toward end-to-end
system architectures based on data partitioning and dataflow parallelism, creating
elastic data-data management architectures capable of adapting and scaling to
existing and future business data requirements.
Based on hardware trajectories and customer appetite for larger and larger
data processing, we predict the following trends for 2008:
Trend 1: Creation of TB scalable data management systems that are based
entirely on main-memory storage.
Main-memory data management approaches are becoming mainstream in highly competitive
business domains like the finance vertical, where low-latency, extreme transaction
processing is essential for competitive advantage. Customers are inevitably
looking to scale-out their current memory-based systems. There are numerous
ways to increase scalability:
(1) Vertically scale-up using state-of-the-art many-core hardware coupled with
(2) Replicate data to another machine and load balance processing
(3) Partition data by dividing twice the work across two machines
(4) Compress data so more of it can fit into memory storage.
An example of (1) could be replacing a medium-powered server machine with a
custom compute appliance like the Azul system for running Java applications.
Hardware economics and application contention for shared resources measured
through performance benchmarks will ultimately decide if this is a viable strategy
for scalability and on-demand capacity management. This scalability approach,
however, offers no secondary benefit of improved fault tolerance through distribution.
Also, beefing up a node's memory and CPU resources does not necessarily increase
network bandwidth which might be the scalability bottleneck.
Regarding (2), replication of data servers can improve data processing throughput
and latency as well as increase the overall number of network clients; however,
replication does not scale data volume.
Most noteworthy, simple analytical models warns how with eager (all replicas
updated within the same transaction) group (any peer in the group can update
data anytime) replication deadlocks rise to the third power of the number of
nodes and to the fifth power of the transaction size.
Compression of data (3) increases data storage capacity so long as the codec
overhead does not outweigh the benefit of being able to cache more data in process.
Scalability through partitioning (4) splits up data, processing, and client
load across nodes. This approach offers wide scale-out as nodes can be added
to accommodate new applications and data volume. Partitioning, however, can
slow down processing unless the underlying system can intelligently co-locate
data with application logic.
Each of these scalability approaches has merits and caveats. We predict the
emergence of system architectures that combine all these methods creating an
approach that offers the best overall flexibility and scalability, creating
multi-TB systems that work entirely in memory. This synergistic architecture
will likely begin with data partitioning as the fundamental underlying strategy
for both ad-hoc and planned extensibility. To load balance processing and insure
fault tolerance, data partitions are replicated to at least one backup. Compression
of data objects on partitions maximizes data memory storage density per node
especially when partitions group related data into columnar (vertical) data
structures where all data within a column is a single type. The distributed
system is then deployed on many-core big iron where network latency and I/O
bandwidth is improved by a high performance backplane. At the macro-level, multi-site
system deployments share data with each other through a WAN gateway, partitioning
site specific relevant data across geography:
This combined scalability architecture can be run on any hardware topology.
For example, if price-performance analysis does not justify the purchase of
expensive big iron, the same architecture can be overlaid on to a grid of COTS
Trend 2: Re-emergence parallel databases and a movement to shared-nothing
An inevitable outcome of highly scalable architectures is that data will be
partitioned onto nodes connected as a distributed system. On the upside, networking
many nodes together allows a distributed system to scale incrementally as it
provides applications with additional processors and storage locations. Distributed
systems, on the other hand, also pose fundamental challenges, viz., how to co-locate
related data together with application logic to insure fast processing; how
to insure efficient transactional updates across nodes; how to manage application
consistency requirements and high availability (especially in an asynchronous
environment where nodes can fail and messages can be delivered and applied in
non-deterministic time). When designing scalable architectures, today's system
designers have the hindsight of over 30 years of parallel database design experience.
The parallel database community has grappled with many of these questions resulting
in collective wisdom about three historical approaches spanning from shared
memory to shared-disk to shared-nothing architectures. Both shared-memory and
shared-disk architectures encounter fundamental scalability limitations because
all I/O and memory requests must be transferred over a bus shared by all processors
or between CPUs and disk networks. Scalability can also be limited by distributed
locking between processor nodes since coordination requires complex, message-based
distribution protocols. Shared-nothing is generally regarded as the best scaling
architecture since data is partitioned across storage at each processor node
allowing multiple processors to process large data sets in parallel. In addition,
nodes maintain complete ownership over their data so complex, distributed locking
protocols are not required.
Trend 3: Emergence of dataflow parallelism leading to a SOA development
pattern based on independent entities cooperating via workflow.
A dataflow approach to distributed data management enables both pipelined and
partitioned parallelism. Operators can be composed into parallel dataflow graphs.
By streaming the output of one operator into the input of another, multiple
operators can work in unison creating pipelined parallelism. By partitioning
data into memory on multiple nodes, a task can be split into many independent
operators each working on a part of the data. This is known as partitioned parallelism.
Dataflow is a perfect match for shared-nothing architectures since partitions
can be deployed on individual nodes where local data can be independently owned,
modified, and queried. At the highest level, this isolation of data into partitions
leads to an emerging service-oriented architecture pattern where partitions
behave as abstractions called service entities. A service entity is a single
owner of a collection of data-a holistic chunk of the partitioned business data
domain. A service entity is the only actor that can modify its collection of
data, that is, data within a service entity evolves independently of data within
another service entity; the scope of a transaction is limited to one service
entity and transactions cannot span service entities.
Service entities communicate with each other via messaging queues. Hence the
service entity pattern enables a workflow model where input to one service entity
can be pipelined to downstream service entities.
We anticipate a dataflow approach to scalability will result in a development
pattern where high-level service entities work autonomously, are transactionally
independent, and provide system-wide parallelism implemented via workflow queues.
Effective use of dataflow and a service entity pattern are prescribed by the
following developer guidelines for scalability:
1. Given that service entities are isolated data domains, normalization of
data models is vital to allow independent evolution of data within service entities.
This means data modelers will create globally unique identifiers so data items
can be modified independently from each other and be resolved when needed by
2. Partitioning defines logical boundaries for service entities so agreement
about what data items should be grouped together within the same service entity
will be established and well-defined.
3. It is a fact of nature that globally serializable transactions are not scalable
across nodes in a distributed system. Distributed locking slows down applications
by introducing latency. Instead of distributed transactions, we expect system
architects to use workflow between service entities.
4. Dataflow enables system-wide parallelism, so applications will be designed
to be as embarrassingly parallel as possible.
5. Repeated query patterns can be transformed from request-response polling
to stateful publish-subscribe views driven by dataflow. For example, event driven
applications that perform stream processing will be easily converted to a dataflow
Given these high level system architecture trends for 2008, we see the following
concrete impacts on the market for extreme transaction processing services and
*Move away from traditional disk-based OLTP databases in favor of scalable
memory based systems.
*Merger of OLTP and OLAP solutions. Real-time OLAP systems will emerge that
provide cube views that are updated as underlying OLTP tables change.
*Rise of cloud computing offerings. In addition to ASPs offerings like Amazon
S3 and EC2 as well as IBM's recent cloud computing announcement, we anticipate
traditional database vendors will jump on the bandwagon and realize that databases
can become a software-as-a-service and provide their own cloud computing offerings.
Smaller data vendor players might get into the cloud computing game too, creating
commoditized Google-like infrastructure based on Yahoo's open source Hadoop
GemStone Systems is an enterprise software company that has pioneered the adoption of several enterprise infrastructure technologies through its rich history. GemStone’s products continue to be used by over 200 large enterprise customers in mission-critical environments in critical enterprise-wide deployments in industries such as financial services, the Federal Government, transportation, telecommunications and energy.
GemStone Systems has built deep expertise in enterprise-level technologies such as distributed resource management, in-memory caching and disk persistence, scalable data distribution, high performance computing operations, object management and other areas that are at the core of building a highly reliable, mission-critical data infrastructure. GemStone brings together a unique set of skills and talented people who have pioneered these technologies through its history and those that have successfully driven the creation of markets and products in other successful enterprise software companies. GemStone has also build enterprise-class support models that allow global enterprises to rely upon GemStone as a solid business partner today, with experience dating several years.