BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ZFS and block deduplication

Subject: ZFS and block deduplication
From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)
Date: Fri, 22 Apr 2011 11:41:43 -0400

I have been trying to convince myself that the SHA2/256 hash is 
sufficient to identify blocks on a file system. Is anyone familiar with 
this?

The theory is that you take a hash value of a block on a disk, and the 
hash, which is smaller than the actual block, is unique enough that the 
probability of any two blocks creating the same hash, is actually less 
than the probability of hardware failure.

Now, I know basic statistics well enough to not play the lottery, but 
I'm not sure I can get my head around it. On a completely logical level, 
assume that you have a block size of 32K and a hash size of 32 chars, 
there are 1000 (1024 if we are talking binary 32K) potential duplicate 
blocks per single hash. Right? For every unique block (by hash) we have 
a potential of 1000 collisions.

Also, looking at the "birthday paradox," since every block is equally 
likely as every other block (in reality we know this is not 100% true), 
isn't the creator's stated probability calculations much weaker than 
assumed?

I come from the old school were "god does not play dice" especially with 
storage.

Given a small enough block size with a small enough set size, I can 
almost see it as safe enough for backups, but I certainly wouldn't put 
mission critical data on it. Would you? Tell me how I'm flat out wrong. 
I need to hear it.

Prev by Date: Speaking of on-line/cloud storage... Wuala
Next by Date: ZFS and block deduplication
Previous by thread: Additional BLU videos uploaded
Next by thread: ZFS and block deduplication
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org