ZFS and block deduplication

Fri Apr 22 11:53:23 EDT 2011

On 04/22/2011 11:41 AM, Mark Woodward wrote:
> I have been trying to convince myself that the SHA2/256 hash is
> sufficient to identify blocks on a file system. Is anyone familiar with
> this?
>
> The theory is that you take a hash value of a block on a disk, and the
> hash, which is smaller than the actual block, is unique enough that the
> probability of any two blocks creating the same hash, is actually less
> than the probability of hardware failure.

> Given a small enough block size with a small enough set size, I can
> almost see it as safe enough for backups, but I certainly wouldn't put
> mission critical data on it. Would you? Tell me how I'm flat out wrong.
> I need to hear it.

If you read up on the rsync algorithm 
(http://cs.anu.edu.au/techreports/1996/TR-CS-96-05.html), he uses a 
combination of 2 different checksums to determine block uniqueness. 
And, IIRC, even then he still does an additional final check to make 
sure that the copied data is correct (and copies again if not).

DR