BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ZFS and block deduplication

Subject: ZFS and block deduplication
From: feenberg-fCu/yNAGv6M at public.gmane.org (Daniel Feenberg)
Date: Mon, 25 Apr 2011 09:32:50 -0400 (EDT)
In-reply-to: <4DB575C4.6090901-FJ05HQ0HCKaWd6l5hS35sQ@public.gmane.org>
References: <4DB1A1B7.2040304@mohawksoft.com> <000101cc01b2$52b14ee0$f813eca0$@nedharvey.com> <4DB322E2.9020909@mohawksoft.com> <000001cc02f3$e206d9b0$a6148d10$@nedharvey.com> <4DB575C4.6090901@mohawksoft.com>


On Mon, 25 Apr 2011, Mark Woodward wrote:

> On 04/24/2011 10:52 PM, Edward Ned Harvey wrote:
>>> From: Mark Woodward [mailto:markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org]
>>>
>>> You know, I've read the same math and I've worked it out myself. I agree
>> it
>>> sounds so astronomical as to be unrealistic to even imagine it, but no
>> matter
>>> how astronomical the odds, someone usually wins the lottery.
>>>
>>> I'm just trying to assure myself that there isn't some probability
>> calculation
>>> missing. I guess my gut is telling me this is too easy.
>>> We're missing something.
>> See - You're overlooking my first point.  The cost of enabling verification
>> is so darn near zero, that you should simply enable verification for the
>> sake of not having to justify your decision to anybody (including yourself,
>> if you're not feeling comfortable.)
> Actually, I'm using ZFS as an example. I doing something different, but
> the theory is the same, and yes, I'm still using SHA265.
>> Actually, there are two assumptions being made:
>> (1) We're assuming sha256 is an ideally distributed hash function.  Nobody
>> can prove that it's not - so we assume it is - but nobody can prove that it
>> is either.  If the hash distribution turns out to be imbalanced, for example
>> if there's a higher probability of certain hashes than other hashes...  Then
>> that would increase the probability of hash collision.
> True.
>> (2) We're assuming the data in question is not being maliciously formed for
>> the purposes of causing a hash collision.  I think this is a safe
>> assumption, because in the event of a collision, you would have two
>> different pieces of data that are assumed to be identical and therefore one
>> of them is thrown away...  And personally I can accept the consequence of
>> discarding data if someone's intentionally trying to break my filesystem
>> maliciously.
> I'm not sure this point is important. I trust that SHA256 is pretty darn
> hard to create a collision. I would almost believe that it would be more
> likely that blocks collided by random chance than malice.
>>> Besides, personally, I'm looking at 16K blocks which increases the
>> probability
>>> a bit.
>> You seem to have that backward - First of all the default block size is (up
>> to) 128k...  and the smaller the blocksize of the filesystem, the higher the
>> number of blocks and therefore the higher the probability of collision.
> This is one of those things that make my brain hurt. If I am
> representing more data with a fixed size number, i.e. a 4K block vs a
> 16K block, that does, in fact, increase the probability of collision 4X,

Only for very small blocks. Once the block is larger than the hash, the 
probability of a collision is independent of the block size.

Daniel Feenberg

> however, it does decrease the total number of blocks by about 4x as well.
>
>
>> If for example you had 1Tb of data, broken up into 1M blocks, then you would
>> have a total number of 2^20 blocks.  But if you broke it up into 1K blocks,
>> then your block count would be 2^30.  With a higher number of blocks being
>> hashed, you get a higher probability of hash collision.
> It comes down to absolute trust that the hashing algorithm works as
> expected and that the data is as randomly distributed as expected.
>
> I'm sort of old school I guess. The mind set is not about probability,
> it is about absolutes. In data storage, it has always been about
> verifiability and we conveniently address probability of failure as a
> different problem and address it differently. This methodology seems to
> merge the two. Statistically speaking, I think I'm looking for 100%
> assurances, and no such assurance has ever really existed.
>
> Its cool stuff. It is a completely different way of looking at storage.
>
>
>
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>

References:
- ZFS and block deduplication
  - From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)
- ZFS and block deduplication
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)
- ZFS and block deduplication
  - From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)
- ZFS and block deduplication
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)
- ZFS and block deduplication
  - From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)

Prev by Date: ZFS and block deduplication
Next by Date: ZFS and block deduplication
Previous by thread: ZFS and block deduplication
Next by thread: ZFS and block deduplication
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org