Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] On-site backups revisited - rsnapshot vs. CrashPlan



Jerry asked:
> I'm not sure what you really mean by sync and rotate in rsnapshot context,.

Some of the documentation recommends using a directive called "sync_first". 
If you set that to 1, it overrides default behavior.  It's then necessary to
invoke "rsnapshot sync" prior to "rsnapshot hourly".  The reason for doing
this is so you can catch any errors and either handle them in your script, or
re-run the sync manually after fixing problems, prior to rotation:  it's an
improvement but not well explained.

> You could keep [checksums]
> in a flat file at the same level where you archive is, so after
> hourly.0 is complete, set up an hourly.0.checksum.

Rich Pieri <richard.pieri at gmail.com> echoed:
> This is a better idea than storing the checksums in a database. It
> ensures that any given version of a file is always associated with the
> correct checksum list.

I think there are a couple of advantages to keeping backup metadata in a
database table:  it's reachable from everywhere which makes it easier to write
integrity-checker scripts (especially against offline backups), and you can
optimize the checksum process more easily (only generate checksums for new
files that aren't yet in the metadata storage).  I'm also thinking long-term: 
if I keep a really long, like 20-year, retention of this data I can know for
certain that I still have the same file I started with.  Checksums kept in the
same place as backups can't protect against short-term accidental rollover of
the backups themselves.  I can then create scripts which block me from such
accidental rollovers.

For what it's worth:  creating the database schema and insertion script was
about 3 hours of work, which I've already done.  I'm amazed at how many times
the backup wheel has been reinvented, but without some of these fundamental
capabilities.  If I get ambitious, I'll package up my efforts and post to
github Yet Another Backup Utility Whose Name I'll Have to Dream Up.

Tom Metro wrote:
> So the scenario you are trying to protect from is one in which
> your source files are good, but your snapshot files get corrupt,
> while maintaining original size and timestamp, and thus are not
> overwritten by rsync?

As you indicated, I don't think I want rsnapshot to verify checksums on every
incremental pass, and I also don't want to burden the central backup server. 
So my thought is I can write scripts that separate out the "checksum my
archive" and  "verify archive against saved checksums" processes from the
"compare source with archive checksums" process.  This is modeled after my
understanding of CrashPlan's efforts to "protect" against corruption.  Their
feature for "verify archive against saved checksums" is called Compact; by
default this runs every 28 days, goes through the archive and /deletes/ any
files which fail -- they call this "self-healing" (by that they mean the
software stops trusting a corrupted file and enables it to be saved again, but
meanwhile that file is exposed to loss).

With 8TB of data and counting, including both near-line storage and off-line
storage, I need to come up with efficient mechanisms for doing these 3
different things at varying intervals.  That's probably the main reason I'm
going with home-brew rather than anything I could find off the shelf.

Someone mentioned git-annex which looks interesting.  I wonder how well it
would scale to an archive of a million files and 50+ savesets, which is what
I'll likely be seeing at future employers?

-rich





BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org