Technology: Data Deduplication   Leave a comment

– Is the elimination of redundant data
– Rely upon cryptographic hash functions for identification of duplicate segments of data
– Also called intelligent compression of single-instance-storage
– Sometimes used with compression & delta differencing

– Operates at file, block or bit level
– Each chunk of data processed with MD5/SHA-1 hash
o Problem is hash collisions
– Deduplication ratio depends upon
o Type of data
o Change rate of data
o Amount of redundant data
o Type of backup performed (full, incremental, differential)
o Retention length of the archived data
– Types
o Target-based
 in the line between the backup source & the target of the backup data
o Source-based
 Backup software performs the dedupe before sending it to the backup device
– Performance considerations
o Write (ingest) speed
o Restore speed
– Other notes
o Variable-length block data deduplication is best (ie, best opportunity to only save a pointer)

– Deduping to tape could make DR more burdensome

– Cost
– Reduction ratios
– Deployment
– Ease of use
– Impact upon backup/recovery
– Technical questions
o Where is agent installed?
o How does dedupe appliance appear to backup software?
o Is there granularity with any particular app or OS?
o Special hardware, or is it part of the backup solution?
o Is post-processing (target) or in-line (source) dedupe?
o How does dedupe fit into storage needs?

– Data Domain
o Consolidated storage tier for backup, nearline & archive data
o Technology
 Data vulnerability architecture
 Stream-informed segment layout scaling architecture
• Watches RAM before stored to disk
 Replication technology
• Transfers only deduped & compressed unique changes
 Global compression
– Avamar
o Source data dedup at file & sub-file data segment level
o Stores only unique subfile variable length data segments
 Average 24kB & are compressed to average of 12kB
• Then generates a unique 20-byte ID using SHA-1
o Up to 50x reduction in storage
 3.6TB down to 6.1GB
o Agents use 15% more CPU, but complete task 10x faster
o All backups stored as virtual full images
o Distributed indexing architecture based on Unique ID’s
o Run’s integrity checks & checkpoints on stored data
o Tape out option
o Components
 Server/Appliance (data store) – storage or utility node
• Intel-based Red Hat Enterprise Linux
 Administrator software
 Client agent
 Replicator (between two avamar installations)
o VMWare
 Reductions
• 95% in data moved
• 90% in backup times
• 50% in disk impact
• 95% in NIC usage
• 80% in CPU usage
• 50% in memory usage

– Local
o Heavily changed files like: SQL, Notes
– Remote
o Locally stored picture files
o Restore from a disk failure

– Nightly Differential backup to disk
o 3.5 hours
– Backup mail to disk
o 7 hours
– Weekend backup
o 56 hours



Posted March 1, 2013 by terop

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: