The Business Benefits of Deduplication

Untitled Document As the amount and types of data that constitute an organization's information increases exponentially, it becomes more difficult to manage and protect that data. Data deduplication technology, which is maturing in the backup space and emerging in nearline and primary storage solutions, is becoming a necessity for any organization managing large amounts of data. But why is data deduplication so critical? Simple: it has the potential to change the economics of data management and data protection solutions by greatly reducing the costs of storing and transferring data.

An IT manager drawn to the promise of deduplication will want to know how data deduplication delivers disruptive benefits, how it compares to other reduction technologies, what its limitations are and what customers must know when deploying deduplication solutions.

Deduplication compared to other data reduction technologies

First-generation data reduction technologies such as data compression and file single instancing have been available for some time and have been widely adopted. Although these technologies can reduce data footprint and transfer costs, these gains are modest, usually in the range of 2-8X.

The goal of most compression schemes is to reduce the footprint of a single file by removing redundancies and efficiently encoding the results. Little is done to address duplicate or versioned files. Emerging application-aware compression technologies can deliver significantly improved reduction rates for narrow classes of data, but often with significant performance penalties and little or no improvement handling cross-file redundancies.

The goal of file single instance storage (SIS) technologies is to reduce the data footprints of a repository by eliminating redundant files and replacing them with references. But file level SIS does little to reduce the footprint of a single file and doesn't handle sub-file redundancies. Block level SIS technologies improve reduction ratios by eliminating duplicate fixed length blocks across files in a repository, but results are still limited. File level changes that affect data alignment (e.g. byte insertions) typically defeat fixed block schemes. Related, redundant data in dissimilar files are rarely suitably block aligned, further limiting block level SIS technologies, making them a poor choice for versioned environments.

First-generation deduplication technology

First generation variable deduplication technologies address the alignment problems inherent in file and block level SIS. Variable technologies analyze incoming data at the sub-file level to assign natural boundaries that vary based on content. This allows for better redundancy detection granularity and relieves alignment constraints across files. Consequently, variable schemes have the potential to adapt better to varied workloads, eliminating redundant data equally well whether in file, or across similar (versioned) or dissimilar files.

The alignment advantages inherent in variable schemes also tend to scale better than competing technologies as the volume of data in a repository grows. In particular, as new and potentially dissimilar data arrives, the probability of unaligned redundancies increases allowing for extra levels of reduction.

Variable schemes come with a cost: evaluating data streams and managing the persistence of variable extents can eat up compute resources. Successful implementations must be sized appropriately, requiring carefully balance between hardware and software to maintain reasonable levels of performance.

Second-generation deduplication technologies

As variable deduplication technology gained a foothold in many IT centers, early implementations worked best when applied to unstructured "copy-oriented" workloads such as backup. These workloads are naturally redundant, and thanks to the file-at-a-time sequential IO patterns represented a relatively easy target for balancing performance and reduction costs. But leveraging this technology to wider spectrum of data sets has been less successful to date.

To address this, newer implementations are becoming increasingly adept at handling more varied workloads and structured data sets while keeping the performance costs in check. These improvements are fueling the incremental expansion of deduplication into the archive and nearline markets, and will eventually set the stage for expansion into key parts of the primary market.

Deduplication implementations are also becoming more aware of application-specific data layouts to improve reduction ratios. This awareness moves past fast algorithmic recognition of repeating patterns and more towards understanding application-specific data formats. Some implementations require full format awareness limiting their effectiveness to supported data sets. Others attempt to recognize known formats so as to amplify reduction rates. Either way, significant improvements in reduction rates are achieved, furthering the gap with other technologies.

Compression-like capabilities are also being increasingly integrated into deduplication implementations. Commonly, first-generation implementations were layered on top of standard compression schemes to reduce the final footprint of unique data. But this "second pass" over the data simply added to the coverall compute costs, limiting overall performance and requiring more complex and expensive hardware solutions. Newer schemes are merging compression-like methodology directly into the deduplication process, avoiding costly second passes and yielding better performance with lower hardware costs.

The benefit of reducing data transfer volumes through deduplication

Although deduplication technology can significantly reduce the cost of storing data, the potentially larger value comes from its ability to reduce the volume of data that needs to be transferred for data protection and data management tasks.

Deduplication technologies enable data transport layers to negotiate the smallest set of unique data that needs to be pushed between one or more systems, often yielding massive reductions in site synchronization and physical transfer costs. These reductions enable far larger data sets to be effectively shipped between sites, effectively multiplying bandwidth.

Variable deduplication schemes significantly enhance these transfer benefits: as data moves between systems, the ability to handle redundant files is amplified by the ability to handle sub-file, unaligned duplicate data yielding transfer reduction ratios comparable to those seen when storing data.

Understanding deduplication variants

All data reduction technologies fundamentally "balance" performance costs against reduction rates and scope for targeted workloads and markets. Although variable duplication offers disruptive benefits, differing implementations do so in different ways, at different times and with different costs. Thus, it's helpful to understand the high-level process variations.

Inline, post-process and adaptive deduplication

Inline variants process incoming data in memory, removing all redundancies so only unique data blocks and the relevant maps are written to disk. When data rates are limited, redundancies exist and overall data set size is contained, disk throughput is used efficiently, available storage space is maximized and tasks like replication begin immediately. But to realize these benefits, inline schemes must impose a tight processing budget, and when the budget can't be met, those inline schemes must throttle back the incoming data stream. Consequently, the maximum data rate inline schemes can support is directly limited by the availability and cost of compute and memory resources, so they tend to fit best in low- to mid-range IP-based environments where bandwidth is limited, throttling is accepted and data set growth is contained.

Post-process variants bias performance by deferring deduplication processing. Typically, incoming data is written directly to disk at peak rates and when the ingest process completes, data is revisited and reduced. Post-process schemes effectively remove the compute and memory resource costs from ingest and can avoid throttling, but at the cost of delayed reduction and replication, as well as temporary surges in capacity. Post-process variants fit best in higher-end SAN-based environment where top throughput is at a premium. Vendors also leverage a key side effect of post-processing: since the data sets most recently ingested tend to be the ones read back most frequently, access to the spooled copies offers a significant boost in retrieval performance and these redundant copies are often retained until capacity limits are reached.

Hybrid or adaptive variants attempt to blend the benefits of inline and post-process schemes while minimizing the drawbacks. As data is ingested at low or medium data rates, deduplication can occur largely inline, but when rates surge, these schemes avoid throttling by spooling data to disk, effectively creating a second level cache. Deduplication occurs in parallel, pulling data from the disk in first-in, first-out order.

The big advantage of adaptive schemes is that they can support low- and mid-range IP-based and higher end SAN environments equally well, and require little tuning to adapt to expectations. More importantly, they allow customers great latitude in deployment, more options for tuning and generally enhance policy controls.

Client-side or target-side deduplication?

Client-side deduplication variants perform all deduplication processing on the client system before transmitting data to the target system. Target side deduplication variants perform all deduplication processing after data arrives on the target system.

One of the key benefits of target-side deduplication is that client systems are not affected by the deduplication processing. The target system is typically provisioned to properly support deduplication in the capacity and performance range that the system was designed to support. But most important, target-side variants are designed to transparently integrate into the existing data storage environment. These systems hide deduplication behind typical industry standard protocols, so no proprietary transports or costly software switch-outs are required. But transparency comes with a cost: target-side variants do little to reduce the volume of data that flows between the client and target systems.

The principle benefit of client-side deduplication variants is that they reduce the volume of data flowing to the target system. Client side variants aggregate the processing power of multiple clients allowing the use of less costly target storage systems. However, these benefits come with costs that often prohibit cost effective deployment. Client-side deduplication can easily over-stress client systems and often significantly degrades application performance. In addition, proprietary client software and transfer protocols can force time-consuming and costly large-scale software infrastructure switch-outs.

What does this mean to the customer?

Data reduction technology is powerful, maturing at a fast rate and is a critical technology for all storage and data protection/disaster recovery solutions going forward by virtue of its ability to reduce the footprint and transfer costs inherent in managing huge amounts of data. However, not all reduction technologies offer similar benefits. Deduplication technology is at the forefront, blending many of the benefits of the other technologies, but customers must carefully balance data set awareness with reduction expectations and performance requirements.

About the Author

Jeffrey Tofano is a recognized industry veteran, having worked in storage and data protection for more than 25 years. As Quantum's chief technology officer, he oversees the company's technological vision and portfolio roadmaps with an emphasis on integrating data de-duplication and file system technologies into the company's broader solutions strategy, as well as protecting critical business data in a virtual world. Previously, he served as Technical Director at NetApp and Chief Architect at OnStor. Since joining Quantum in 2007, Tofano has spoken on various topics covering emerging data protection concepts and technologies for audiences at data protection and storage network conferences. These include: Storage Networking Industry Association (SNIA) events, Techtarget Backup School end user tutorials, and national conferences such as Storage Network World (SNW) and Symantec Vision. Tofano can be reached at

More by Jeffrey Tofano