|
|
||||||||||||||||||||||||||||||||||||||||
Data Dedupe Designed Specifically for Large EnterprisesEnterprise IT managers have several data dedupe technologies to choose from in their efforts to control exponential data growth and reduce capacity requirements. While some are better for small-to-medium businesses, others are specifically designed for enterprise data dedupe requirements. Consider the strengths and drawbacks of each of these technologies to choose the solution that best meets your needs. All data dedupe technologies are not alike.Deduplication technologies reduce capacity by comparing a given data set to a baseline data set at the sub-file or byte level to identify and replace duplicate data with a pointer to the baseline occurrence of it. Many use a technique called “hashing” to describe incoming data and store hash values in an index when data as written to disk. Data dedupe may be performed at the source or at the target. Data dedupe performed at the target may be performed inline before data is written to the disk or post processed after the data is written to the disk. Each of these technologies is described below.
Data dedupe at the source.Source data dedupe typically performed at the backup media server to reduce the amount of data sent to the target or it is performed on individual client systems (desktops and servers) before data is sent to media servers. Dedupication is typically performed on a backup media server before data is sent to the target. These solutions dedupe chunks of data rather than entire files. For SMB data volumes it is an efficient way to reduce capacity. However, like all hash-based approaches, it makes ingest performance is variable – higher change rates result in lower backup performance and higher network loads. Considering that source data dedupe is a feature of most major enterprise backup application, there are a number of reason not to use source data dedupe in enterprise data centers, For example, they required added CAPEX and complexity by requiring enterprises to buy larger servers or redistribute server workloads to compensate for the significant extra workloads involved in source data dedupe. They can also cause a possible lock-in between the backup application and the target Data dedupe at the target.There are two approaches to data dedupe on the target system: the hash-based approach which performs data dedupe inline before it is written to disk, and the post-process approach, which uses a processing pipeline to give priority to ingest over data dedupe. Data is typically compressed after data dedupe to achieve additional storage savings. Hash-based data dedupe.Hash-based data dedupe starts with analyzing segments of data as it is backed up and assigning a unique identifier to each segment called a hash. Most technologies use an algorithm that computes a cryptographic hash value from a fixed or variable segment of data in the backup stream, independent of the data type. The hash value is compared to an index of all hash identifiers. If the hash value is not already in the index, the data it describes is new and it is written to disk. If the hash value already exists in the index, then the data it describes is a duplicate of data that has already been stored. The duplicate data is replaced with a pointer to the unique data record. The benefit of this approach is that in a steady-state condition where there is a small rate of change between successive backups (typically < 5 percent), only the new data is written to disk. However, the performance of hash-based systems is constrained by both the index lookup speed and the storage subsystem, which is optimized for a relatively low write throughput. Inline data dedupe, whether performed at the source or target, is not recommended for large data volumes for a number of reasons.
|
Deduplication
|
|||||||||||||||||||||||||||||||||||||||
|
SEPATON, Inc.
|
||||||||||||||||||||||||||||||||||||||||







