Support Services

Partners

Data Dedupe Designed Specifically for Large Enterprises

Enterprise IT managers have several data dedupe technologies to choose from in their efforts to control exponential data growth and reduce capacity requirements. While some are better for small-to-medium businesses, others are specifically designed for enterprise data dedupe requirements. Consider the strengths and drawbacks of each of these technologies to choose the solution that best meets your needs.  

All data dedupe technologies are not alike.

Deduplication technologies reduce capacity by comparing a given data set to a baseline data set at the sub-file or byte level to identify and replace duplicate data with a pointer to the baseline occurrence of it. Many use a technique called “hashing” to describe incoming data and store hash values in an index when data as written to disk. Data dedupe may be performed at the source or at the target. Data dedupe performed at the target may be performed inline before data is written to the disk or post processed after the data is written to the disk. Each of these technologies is described below.

Table 1: Comparison of Data dedupe Requirements

 

Enterprise

SME

SMB

Data under protection

>100TB

<100TB

<5TB

Nightly Backup

>10TB

<3TB

<500GB

Data Change Rate

>5%

<2%-5%

<1%

Guarantee
data safety

Yes

Maybe

Don’t Care

Data Type

Databases
- Email
- Office
- Specialty apps

- Email
- Office
 - Some Database

- Email
- Office

Retention Period

Compliance driven

Compliance drive

Don’t care

Best Suited to Address

SEPATON

SEPATON

Inline data dedupe appliances

Data dedupe at the source.

Source data dedupe typically performed at the backup media server to reduce the amount of data sent to the target or it is performed on individual client systems (desktops and servers) before data is sent to media servers. Dedupication is typically performed on a backup media server before data is sent to the target. These solutions dedupe chunks of data rather than entire files. For SMB data volumes it is an efficient way to reduce capacity. However, like all hash-based approaches, it makes ingest performance is variable – higher change rates result in lower backup performance and higher network loads.

Considering that source data dedupe is a feature of most major enterprise backup application, there are a number of reason not to use source data dedupe in enterprise data centers, For example, they required added CAPEX and complexity by requiring enterprises to buy larger servers or redistribute server workloads to compensate for the significant extra workloads involved in source data dedupe. They can also cause a possible lock-in between the backup application and the target

Data dedupe at the target.

There are two approaches to data dedupe on the target system: the hash-based approach which performs data dedupe inline before it is written to disk, and the post-process approach, which uses a processing pipeline to give priority to ingest over data dedupe. Data is typically compressed after data dedupe to achieve additional storage savings.

Hash-based data dedupe.

Hash-based data dedupe starts with analyzing segments of data as it is backed up and assigning a unique identifier to each segment called a hash. Most technologies use an algorithm that computes a cryptographic hash value from a fixed or variable segment of data in the backup stream, independent of the data type.

The hash value is compared to an index of all hash identifiers. If the hash value is not already in the index, the data it describes is new and it is written to disk. If the hash value already exists in the index, then the data it describes is a duplicate of data that has already been stored. The duplicate data is replaced with a pointer to the unique data record.

The benefit of this approach is that in a steady-state condition where there is a small rate of change between successive backups (typically < 5 percent), only the new data is written to disk. However, the performance of hash-based systems is constrained by both the index lookup speed and the storage subsystem, which is optimized for a relatively low write throughput.

Inline data dedupe, whether performed at the source or target, is not recommended for large data volumes for a number of reasons.

  • No performance scalability. All backup and data dedupe processing as well as backup, restore, and replication has be performed on a single node. Data is not deduplicated among the individual instances.
  • Ingest speed slows over time. Hash calculations and lookups take time, slowing data ingest (backup performance) a bottleneck that worsens over time, as the index table grows
  • Poor restore times over time as data has to be located and reconstituted from more and more data dedupe pointers before the restore process can begin.
  • Inefficient database dedupe. Hash-based systems cannot examine segments of data that are smaller than 8 KB without bringing their performance to a crawl. Since most business critical enterprise databases, such as SAP, Oracle, and SQL server store data in segments smaller than 8 KB, hash/inline data dedupe does not eliminate a significant percentage of duplicate data.


The power of concurrent deduplication processing.

The goal is to ingest and protect backup data at the maximum possible speed, minimizing the “time-to-safety”. Some systems backup data to disk then perform data dedupe in discrete steps. While this improves the backup time, it slows completion of data dedupe, replication, and restore processes.

In contrast, SEPATON’s DeltaStor data dedupe performs concurrent processing by load balancing backup, data dedupe, replication, and restore operations across as many as eight nodes. As a result, each of these processes can be performed concurrently for an optimal balance of time to safety and capacity reduction. SEPATON meets the world’s most demanding data protection SLAs, ROI, TCO requirements.

DeltaStor data dedupe offers a wider range of advantages for enterprise data centers, including:

  • Fastest time to safety. Data is moved to the safety of the data protection platform at the fastest ingest rates in the industy, without being slowed by deduplication processing. Backup, deduplication, replication and restore processing are performed concurrently and load balanced across all available processing nodes.
  • Most efficient capacity reduction. DeltaStor can compare data at the byte level for and the sub-8 KB level for maximum capacity reduction – even in data base and “incremental forever” environments.
  • Fastest time to restore. DeltaStor uses the most recent backup as the baseline and deduplicates older data against it so it can be restored instantly.
  • Bandwidth optimized replication. DeltaStor works in conjunction with DeltaRemote replication software to reduce the amount of data that needs to be replicated by as much as 97 percent
Deduplication
Data Dedupe Software Data Sheet Data Sheet: Data Deduplication Software
Learn how SEPATON DeltaStor software delivers enterprise class data deduplication.

>>Download