Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ZFS and block deduplication



> From: discuss-bounces-mNDKBlG2WHs at public.gmane.org [mailto:discuss-bounces-mNDKBlG2WHs at public.gmane.org] On Behalf
> Of Mark Woodward
> 
> I have been trying to convince myself that the SHA2/256 hash is
> sufficient to identify blocks on a file system. Is anyone familiar with
> this?

I am intimately familiar with this.  And on planet Earth, yes it is
sufficient.  However, the cost of explaining your decision to anyone who is
skeptical outweighs the cost of enabling verification.  So as long as
there's even a possibility that anybody might challenge your decision, you
should enable verification.  But if you're the boss and nobody will
second-guess you, then you can safely rely on just the hash.

If you'd like to read more about it, try this search...  Google for
zfs-discuss collision
http://www.google.com/search?q=zfs-discuss+collision&ie=utf-8&oe=utf-8&aq=t&;
rls=org.mozilla:en-US:official&client=firefox-a 
Mostly in the extremely long thread, "(Fletcher+Verification) versus
(Sha256+No Verification)" 

PS.  "Safely" is a relative term.  The probability of collision is non-zero,
but the probability is essentially zero relative to Human Extinction Events,
which means you can (relatively) safely rely on just the sha256 hash.

If you want to calculate the actual probability of a collision, assume a 4k
(2^12 bytes) block size (worst case) and every single block is precisely the
same size (which isn't realistic, but is worst case) and every single block
is unique (in which case why have you enabled dedup.  So again,
unrealistically evil-clown scenario worst case) and if your data pool size
is the largest in the world (again worst case) say ... 2PB (2^41 bytes)...
that would be physically impossible to hold the dedup tables using hardware
currently available in the world, but again...  Evil clown worst case for
accidental collision.  Then the number of blocks is 2^41 / 2^12 = 2^29
unique blocks.

The formula on wikipedia for the birthday problem is:
p(n;d) ~= 1-( (d-1)/d )^( 0.5*n*(n-1) )

In this case, 
n=2^29
d=2^256

Using bc to calculate the answer:
bc -l
 
n=2^29
d=2^256
scale=1024
1-e(   (  0.5*n*(n-1)*l((d-1)/d)  )   )
.0000000000000000000000000000000000000000000000000000000000012446030
I manually truncated here (precision goes out 1024 places).  This is
1.24E-60

Notice:  There are estimated 1E50 atoms in Earth.  So in the evil clown
worst case for sha256 collision, the probability for collision is about the
same as randomly selecting the same atom twice consecutively from 1000
Earths.

Note: I had to repeat the calculation many times in bc, setting a larger and
larger scale.  The default scale of 20, and even 64 and 70 and 80 were not
precise enough to produce a convergent answer around the -57th decimal
place.  So I just kept going larger, and in retrospect, anything over 100
would have been fine.  I wrote 1024 above, so who cares.






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org