[Discuss] ZFS vs. Btrfs

Mon Jan 7 19:39:45 EST 2013

> From: discuss-bounces+blu=nedharvey.com at blu.org [mailto:discuss-
> bounces+blu=nedharvey.com at blu.org] On Behalf Of Jerry Feldman
> 
> In my mind the important issue is resistance to drive failure. What
> happens in both ZFS and Btrfs in the case of a power failure.

In zfs, data is written to disk in transaction groups (TXG's).  There are some reserved blocks that are used as a ring buffer, to store the uber block.  When a TXG is written to pool, the uber block is updated.  When pool is mounted, system looks in the reserved uberblock storage area, finds the entry with the highest transaction number and matching checksum, and that entry is used as the latest fully flushed uberblock/TXG.  So therefore all transactions are atomic, and the filesystem is not possible to write in an inconsistent state (unless you have failing cpu or memory or something like that calculating incorrect checksums.)  So after a power outage or kernel panic, your filesystem is definitely consistent, and you may only lose up to the latest 5 seconds of async buffered writes prior to crash, that maybe were still yet-to-be flushed to disk.

It's slightly more complicated when you consider sync-mode writes.  Sync writes are immediately written to NV storage, which are ZIL blocks in-pool if you don't have a dedicated device, but after sync writes hit the ZIL, they become async writes and get buffered with all the other async writes.  At pool mount time, system checks the ZIL for any unflushed transactions, and if necessary, flushes them to pool.  This guarantees both filesystem consistency, and posix behavior compliance, that sync writes be preserved in NV storage and survive such a crash.

It is therefore possible, as you might expect, that sync writes will find their way into the filesystem consistently, while a few seconds worth of unflushed async buffered writes might be lost.  But filesystem inconsistency isn't one of the possible end results.

In btrfs ... I have less detail ... but I know they write in transactions, and they do journaling (logging).  So the filesystem will be consistent.  They honor the posix behavior of sync writes to NV storage, so sync writes are guaranteed to be preserved.  And of course, async buffered writes are bound to be vulnerable to the crash.  So qualitatively you'll have similar reliability / crash guarantees.  But I know in ZFS (any modern version), the maximum length of time between TXG flushes is 5 sec...  I don't know if they have any similar time limits on btrfs, and if they do, I don't know what their values are.