Big Data Demands Big Data Backup
As I mentioned in my last blog Warning: Big Data will Break Your Backup, we have all gotten used to hearing about huge data growth and its impact on data centers. As daunting as this data growth has been, it doesn’t compare to the tsunami being generated in Big Data environments. Big data is measured in petabytes and growing by orders of magnitude annually. Simply adding more traditional data protection technologies – tape and multiple siloed systems is not sufficient. They are just too slow, and too labor-intensive be feasible and cost-effective in these environments.
Big Data Environments Need a New Approach
Just as big data primary storage is efficient, fast, and scalable, big data secondary storage and data protection needs to be efficient, fast, and scalable too. Let’s start with efficient.
With big data protection comes big costs. The days of backing up and storing everything (just in case) are over. Cost reduction, labor savings, and system efficiency are paramount to allow organizations to protect massively growing amounts of data while meeting backup windows and business objectives. Automatic, tiered data protection that includes CDP, snapshots, D2D, tape, remote replication and cloud is a must-have. Data protection technologies designed for big data environments need to be capable of automatically identifying and moving low-priority data to the lowest cost recovery tier - without administrator involvement.
Tiered Storage Becomes Essential
A tiered recovery model enables data managers to balance costs with recovery time, depending on the recovery SLA requirements of the specific dataset. While low priority data may have slower restore times, all restores use the same mechanism, regardless of tier.
Second, even with smart policy driven tiering, big data environments still need fast, scalable ingest performance to meet their data protection needs. Stacking up dozens of single-node, inline deduplication systems will lead to crushing data center sprawl and complexity. To meet data protection needs and windows, big data environments need massive single system capacity coupled with scalable, multi-node deduplication that doesn’t slow data ingest or bog down restore times.
Compatibility with Existing Infrastructure and Reporting
Third, given the scale of big data environments, high risk rip and replace solutions are out of the question. New data protection solutions need to coexist and integrate seamlessly into existing environments without disruption or added complexity. Optimally, IT staff can manage the resulting environment - both new and existing IT resources - in one simple view that consolidates management and reporting for the entire infrastructure.
Database Deduplication Efficiency
Big Data environments often have a large volume of duplicate data in segments that are too small for many of today’s deduplication technologies to detect. New big data-optimized deduplication technologies are required to minimize costs and footprint regardless of where the data is retained.
The bottom line: traditional data protection cannot handle demanding big data environments. A new approach is needed that delivers:
• Fast, deterministic ingest/outgress performance
• Intelligent, automated tiering
• Enterprise-wide management control and reporting
• Scalable deduplication designed specifically for big data database environments
In a coming blog, I will cover the Big Data topic in more depth with a discussion of what is needed in a big backup appliance.