ZFS and block deduplication

Fri Apr 22 11:41:43 EDT 2011

I have been trying to convince myself that the SHA2/256 hash is 
sufficient to identify blocks on a file system. Is anyone familiar with 
this?

The theory is that you take a hash value of a block on a disk, and the 
hash, which is smaller than the actual block, is unique enough that the 
probability of any two blocks creating the same hash, is actually less 
than the probability of hardware failure.

Now, I know basic statistics well enough to not play the lottery, but 
I'm not sure I can get my head around it. On a completely logical level, 
assume that you have a block size of 32K and a hash size of 32 chars, 
there are 1000 (1024 if we are talking binary 32K) potential duplicate 
blocks per single hash. Right? For every unique block (by hash) we have 
a potential of 1000 collisions.

Also, looking at the "birthday paradox," since every block is equally 
likely as every other block (in reality we know this is not 100% true), 
isn't the creator's stated probability calculations much weaker than 
assumed?

I come from the old school were "god does not play dice" especially with 
storage.

Given a small enough block size with a small enough set size, I can 
almost see it as safe enough for backups, but I certainly wouldn't put 
mission critical data on it. Would you? Tell me how I'm flat out wrong. 
I need to hear it.