BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ZFS and block deduplication

Subject: ZFS and block deduplication
From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)
Date: Mon, 25 Apr 2011 09:23:16 -0400
In-reply-to: <000001cc02f3$e206d9b0$a6148d10$@nedharvey.com>
References: <4DB1A1B7.2040304@mohawksoft.com> <000101cc01b2$52b14ee0$f813eca0$@nedharvey.com> <4DB322E2.9020909@mohawksoft.com> <000001cc02f3$e206d9b0$a6148d10$@nedharvey.com>

On 04/24/2011 10:52 PM, Edward Ned Harvey wrote:
>> From: Mark Woodward [mailto:markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org]
>>
>> You know, I've read the same math and I've worked it out myself. I agree
> it
>> sounds so astronomical as to be unrealistic to even imagine it, but no
> matter
>> how astronomical the odds, someone usually wins the lottery.
>>
>> I'm just trying to assure myself that there isn't some probability
> calculation
>> missing. I guess my gut is telling me this is too easy.
>> We're missing something.
> See - You're overlooking my first point.  The cost of enabling verification
> is so darn near zero, that you should simply enable verification for the
> sake of not having to justify your decision to anybody (including yourself,
> if you're not feeling comfortable.)
Actually, I'm using ZFS as an example. I doing something different, but 
the theory is the same, and yes, I'm still using SHA265.
> Actually, there are two assumptions being made:
> (1) We're assuming sha256 is an ideally distributed hash function.  Nobody
> can prove that it's not - so we assume it is - but nobody can prove that it
> is either.  If the hash distribution turns out to be imbalanced, for example
> if there's a higher probability of certain hashes than other hashes...  Then
> that would increase the probability of hash collision.
True.
> (2) We're assuming the data in question is not being maliciously formed for
> the purposes of causing a hash collision.  I think this is a safe
> assumption, because in the event of a collision, you would have two
> different pieces of data that are assumed to be identical and therefore one
> of them is thrown away...  And personally I can accept the consequence of
> discarding data if someone's intentionally trying to break my filesystem
> maliciously.
I'm not sure this point is important. I trust that SHA256 is pretty darn 
hard to create a collision. I would almost believe that it would be more 
likely that blocks collided by random chance than malice.
>> Besides, personally, I'm looking at 16K blocks which increases the
> probability
>> a bit.
> You seem to have that backward - First of all the default block size is (up
> to) 128k...  and the smaller the blocksize of the filesystem, the higher the
> number of blocks and therefore the higher the probability of collision.
This is one of those things that make my brain hurt. If I am 
representing more data with a fixed size number, i.e. a 4K block vs a 
16K block, that does, in fact, increase the probability of collision 4X, 
however, it does decrease the total number of blocks by about 4x as well.


> If for example you had 1Tb of data, broken up into 1M blocks, then you would
> have a total number of 2^20 blocks.  But if you broke it up into 1K blocks,
> then your block count would be 2^30.  With a higher number of blocks being
> hashed, you get a higher probability of hash collision.
It comes down to absolute trust that the hashing algorithm works as 
expected and that the data is as randomly distributed as expected.

I'm sort of old school I guess. The mind set is not about probability, 
it is about absolutes. In data storage, it has always been about 
verifiability and we conveniently address probability of failure as a 
different problem and address it differently. This methodology seems to 
merge the two. Statistically speaking, I think I'm looking for 100% 
assurances, and no such assurance has ever really existed.

Its cool stuff. It is a completely different way of looking at storage.

References:
- ZFS and block deduplication
  - From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)
- ZFS and block deduplication
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)
- ZFS and block deduplication
  - From: markw-FJ05HQ0HCKaWd6l5hS35sQ at public.gmane.org (Mark Woodward)
- ZFS and block deduplication
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)

Prev by Date: Gnome 3 Discussions
Next by Date: ZFS and block deduplication
Previous by thread: ZFS and block deduplication
Next by thread: ZFS and block deduplication
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org