As the amount and types of data that constitute an organization's information
increases exponentially, it becomes more difficult to manage and protect that
data. Data deduplication technology, which is maturing in the backup space and
emerging in nearline and primary storage solutions, is becoming a necessity for
any organization managing large amounts of data. But why is data deduplication
so critical? Simple: it has the potential to change the economics of data management
and data protection solutions by greatly reducing the costs of storing and transferring
An IT manager drawn to the promise of deduplication will want to know how data
deduplication delivers disruptive benefits, how it compares to other reduction
technologies, what its limitations are and what customers must know when deploying
Deduplication compared to other data reduction technologies
First-generation data reduction technologies such as data compression and file
single instancing have been available for some time and have been widely adopted.
Although these technologies can reduce data footprint and transfer costs, these
gains are modest, usually in the range of 2-8X.
The goal of most compression schemes is to reduce the footprint of a single
file by removing redundancies and efficiently encoding the results. Little is
done to address duplicate or versioned files. Emerging application-aware compression
technologies can deliver significantly improved reduction rates for narrow classes
of data, but often with significant performance penalties and little or no improvement
handling cross-file redundancies.
The goal of file single instance storage (SIS) technologies is to reduce the
data footprints of a repository by eliminating redundant files and replacing
them with references. But file level SIS does little to reduce the footprint
of a single file and doesn't handle sub-file redundancies. Block level SIS technologies
improve reduction ratios by eliminating duplicate fixed length blocks across
files in a repository, but results are still limited. File level changes that
affect data alignment (e.g. byte insertions) typically defeat fixed block schemes.
Related, redundant data in dissimilar files are rarely suitably block aligned,
further limiting block level SIS technologies, making them a poor choice for
First-generation deduplication technology
First generation variable deduplication technologies address the alignment
problems inherent in file and block level SIS. Variable technologies analyze
incoming data at the sub-file level to assign natural boundaries that vary based
on content. This allows for better redundancy detection granularity and relieves
alignment constraints across files. Consequently, variable schemes have the
potential to adapt better to varied workloads, eliminating redundant data equally
well whether in file, or across similar (versioned) or dissimilar files.
The alignment advantages inherent in variable schemes also tend to scale better
than competing technologies as the volume of data in a repository grows. In
particular, as new and potentially dissimilar data arrives, the probability
of unaligned redundancies increases allowing for extra levels of reduction.
Variable schemes come with a cost: evaluating data streams and managing the
persistence of variable extents can eat up compute resources. Successful implementations
must be sized appropriately, requiring carefully balance between hardware and
software to maintain reasonable levels of performance.
Second-generation deduplication technologies
As variable deduplication technology gained a foothold in many IT centers,
early implementations worked best when applied to unstructured "copy-oriented"
workloads such as backup. These workloads are naturally redundant, and thanks
to the file-at-a-time sequential IO patterns represented a relatively easy target
for balancing performance and reduction costs. But leveraging this technology
to wider spectrum of data sets has been less successful to date.
To address this, newer implementations are becoming increasingly adept at handling
more varied workloads and structured data sets while keeping the performance
costs in check. These improvements are fueling the incremental expansion of
deduplication into the archive and nearline markets, and will eventually set
the stage for expansion into key parts of the primary market.
Deduplication implementations are also becoming more aware of application-specific
data layouts to improve reduction ratios. This awareness moves past fast algorithmic
recognition of repeating patterns and more towards understanding application-specific
data formats. Some implementations require full format awareness limiting their
effectiveness to supported data sets. Others attempt to recognize known formats
so as to amplify reduction rates. Either way, significant improvements in reduction
rates are achieved, furthering the gap with other technologies.
Compression-like capabilities are also being increasingly integrated into deduplication
implementations. Commonly, first-generation implementations were layered on
top of standard compression schemes to reduce the final footprint of unique
data. But this "second pass" over the data simply added to the coverall
compute costs, limiting overall performance and requiring more complex and expensive
hardware solutions. Newer schemes are merging compression-like methodology directly
into the deduplication process, avoiding costly second passes and yielding better
performance with lower hardware costs.
The benefit of reducing data transfer volumes through deduplication
Although deduplication technology can significantly reduce the cost of storing
data, the potentially larger value comes from its ability to reduce the volume
of data that needs to be transferred for data protection and data management
Deduplication technologies enable data transport layers to negotiate the smallest
set of unique data that needs to be pushed between one or more systems, often
yielding massive reductions in site synchronization and physical transfer costs.
These reductions enable far larger data sets to be effectively shipped between
sites, effectively multiplying bandwidth.
Variable deduplication schemes significantly enhance these transfer benefits:
as data moves between systems, the ability to handle redundant files is amplified
by the ability to handle sub-file, unaligned duplicate data yielding transfer
reduction ratios comparable to those seen when storing data.
Understanding deduplication variants
All data reduction technologies fundamentally "balance" performance
costs against reduction rates and scope for targeted workloads and markets.
Although variable duplication offers disruptive benefits, differing implementations
do so in different ways, at different times and with different costs. Thus,
it's helpful to understand the high-level process variations.
Inline, post-process and adaptive deduplication
Inline variants process incoming data in memory, removing all redundancies
so only unique data blocks and the relevant maps are written to disk. When data
rates are limited, redundancies exist and overall data set size is contained,
disk throughput is used efficiently, available storage space is maximized and
tasks like replication begin immediately. But to realize these benefits, inline
schemes must impose a tight processing budget, and when the budget can't be
met, those inline schemes must throttle back the incoming data stream. Consequently,
the maximum data rate inline schemes can support is directly limited by the
availability and cost of compute and memory resources, so they tend to fit best
in low- to mid-range IP-based environments where bandwidth is limited, throttling
is accepted and data set growth is contained.
Post-process variants bias performance by deferring deduplication processing.
Typically, incoming data is written directly to disk at peak rates and when
the ingest process completes, data is revisited and reduced. Post-process schemes
effectively remove the compute and memory resource costs from ingest and can
avoid throttling, but at the cost of delayed reduction and replication, as well
as temporary surges in capacity. Post-process variants fit best in higher-end
SAN-based environment where top throughput is at a premium. Vendors also leverage
a key side effect of post-processing: since the data sets most recently ingested
tend to be the ones read back most frequently, access to the spooled copies
offers a significant boost in retrieval performance and these redundant copies
are often retained until capacity limits are reached.
Hybrid or adaptive variants attempt to blend the benefits of inline and post-process
schemes while minimizing the drawbacks. As data is ingested at low or medium
data rates, deduplication can occur largely inline, but when rates surge, these
schemes avoid throttling by spooling data to disk, effectively creating a second
level cache. Deduplication occurs in parallel, pulling data from the disk in
first-in, first-out order.
The big advantage of adaptive schemes is that they can support low- and mid-range
IP-based and higher end SAN environments equally well, and require little tuning
to adapt to expectations. More importantly, they allow customers great latitude
in deployment, more options for tuning and generally enhance policy controls.
Client-side or target-side deduplication?
Client-side deduplication variants perform all deduplication processing on
the client system before transmitting data to the target system. Target side
deduplication variants perform all deduplication processing after data arrives
on the target system.
One of the key benefits of target-side deduplication is that client systems
are not affected by the deduplication processing. The target system is typically
provisioned to properly support deduplication in the capacity and performance
range that the system was designed to support. But most important, target-side
variants are designed to transparently integrate into the existing data storage
environment. These systems hide deduplication behind typical industry standard
protocols, so no proprietary transports or costly software switch-outs are required.
But transparency comes with a cost: target-side variants do little to reduce
the volume of data that flows between the client and target systems.
The principle benefit of client-side deduplication variants is that they reduce
the volume of data flowing to the target system. Client side variants aggregate
the processing power of multiple clients allowing the use of less costly target
storage systems. However, these benefits come with costs that often prohibit
cost effective deployment. Client-side deduplication can easily over-stress
client systems and often significantly degrades application performance. In
addition, proprietary client software and transfer protocols can force time-consuming
and costly large-scale software infrastructure switch-outs.
What does this mean to the customer?
Data reduction technology is powerful, maturing at a fast rate and is a critical
technology for all storage and data protection/disaster recovery solutions going
forward by virtue of its ability to reduce the footprint and transfer costs
inherent in managing huge amounts of data. However, not all reduction technologies
offer similar benefits. Deduplication technology is at the forefront, blending
many of the benefits of the other technologies, but customers must carefully
balance data set awareness with reduction expectations and performance requirements.
About the Author
Jeffrey Tofano is a recognized industry veteran, having worked in storage and data protection for more than 25 years. As Quantum's chief technology officer, he oversees the company's technological vision and portfolio roadmaps with an emphasis on integrating data de-duplication and file system technologies into the company's broader solutions strategy, as well as protecting critical business data in a virtual world.
Previously, he served as Technical Director at NetApp and Chief Architect at OnStor. Since joining Quantum in 2007, Tofano has spoken on various topics covering emerging data protection concepts and technologies for audiences at data protection and storage network conferences. These include: Storage Networking Industry Association
(SNIA) events, Techtarget Backup School end user tutorials, and national conferences such as Storage Network World (SNW) and Symantec Vision. Tofano can be reached at firstname.lastname@example.org.