Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ZFS and block deduplication



> From: Mark Woodward [mailto:markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org]
>
> You know, I've read the same math and I've worked it out myself. I agree
it
> sounds so astronomical as to be unrealistic to even imagine it, but no
matter
> how astronomical the odds, someone usually wins the lottery.
> 
> I'm just trying to assure myself that there isn't some probability
calculation
> missing. I guess my gut is telling me this is too easy.
> We're missing something.

See - You're overlooking my first point.  The cost of enabling verification
is so darn near zero, that you should simply enable verification for the
sake of not having to justify your decision to anybody (including yourself,
if you're not feeling comfortable.)

Actually, there are two assumptions being made:
(1) We're assuming sha256 is an ideally distributed hash function.  Nobody
can prove that it's not - so we assume it is - but nobody can prove that it
is either.  If the hash distribution turns out to be imbalanced, for example
if there's a higher probability of certain hashes than other hashes...  Then
that would increase the probability of hash collision.
(2) We're assuming the data in question is not being maliciously formed for
the purposes of causing a hash collision.  I think this is a safe
assumption, because in the event of a collision, you would have two
different pieces of data that are assumed to be identical and therefore one
of them is thrown away...  And personally I can accept the consequence of
discarding data if someone's intentionally trying to break my filesystem
maliciously.


> Besides, personally, I'm looking at 16K blocks which increases the
probability
> a bit.

You seem to have that backward - First of all the default block size is (up
to) 128k...  and the smaller the blocksize of the filesystem, the higher the
number of blocks and therefore the higher the probability of collision.  

If for example you had 1Tb of data, broken up into 1M blocks, then you would
have a total number of 2^20 blocks.  But if you broke it up into 1K blocks,
then your block count would be 2^30.  With a higher number of blocks being
hashed, you get a higher probability of hash collision.






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org