BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

admins worst nightmare...

Subject: admins worst nightmare...
From: bogstad-e+AXbWqSrlAAvxtiuMwx3w at public.gmane.org (Bill Bogstad)
Date: Wed, 10 Mar 2010 12:10:38 -0500
In-reply-to: <4B96642E.6020807-wRvlPVLobi1/31tCrMuHxg@public.gmane.org>
References: <4B956177.2010908@stephenadler.com> <000201cabf3b$0b52c570$21f85050$@com> <4B96642E.6020807@stephenadler.com>

On Tue, Mar 9, 2010 at 10:07 AM, Stephen Adler <adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org> wrote:
>...
>
> The backups are made by copying the original data off the sun blade
> system to backup system #1. I then run a nightly cron job which rsyncs
> the data from backup system #1 to backup systems #2 and #3. The problem
> is that backup system #1 has the corrupted file system, so once the
> files were placed on backup system #1 and corrupted, I ended up copying
> corrupted data off on to backup systems #2 and #3.
>
>...
> tera byte drives running a software raid 5 raid array. Also, I have the
> smartd tools running doing nightly and weekly checks. With all that in
> place, there were no warning of errors on the file system. Which makes
> me think there is a bug in ext3/md raid5 or the PCI esata controller
> card is mucked up. I still have to very the memory, which is supposed to
> be ECC memory.

Don't forget that ECC just reduces the probability that bad data will
be read from memory not eliminate it.

Also, you are assuming that the problem here is in system #1.   It
could very well be that the error occurred during the network transfer
when the data was copied from the original system to system #1.  I
can't remember who it was now,
but there was someone from AT&T who gave a talk at BBLISA a few years
back about always doing checksums of files
every time he moved/copied them between networked system .   As  I
recall, he found plenty of errors on 'working' systems which were
handling massive amounts of data.

>
> I tried to do my homework in setting up this backup system, and with all
> the redundancy I put in, I thought I didn't need the md5 check sum.
> Well... I've learned my lesson the hard way.
>
> So... the lesson learned...
>
> ALWAYS DO MD5 CHECK SUMS ON CRITICAL DATA DURING BACKUPS NO MATTER HOW
> LONG IT TAKES, BEFORE YOU DELETE THE ORIGINAL DATA.

And do it every time you transfer that data to a new location.   And
if you are really paranoid, modify your applications
to checksum the data as they read it off of the disk for processing.
Of course this will either require modifying your data file format or
adding an additional auxilliary file in which you store the block(?)
level checksums.   Kind of like what ZFS does, but
extends the protection to transfers across networks or even to storage
on backup media.

Bill Bogstad

References:
- admins worst nightmare...
  - From: adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org (Stephen Adler)
- admins worst nightmare...
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)
- admins worst nightmare...
  - From: adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org (Stephen Adler)

Prev by Date: OpenOffice and Microsoft Office
Next by Date: Password vault programs for Linux, Windows, Smartphones
Previous by thread: admins worst nightmare...
Next by thread: [Positions-available] Mid-level PHP/Drupal Developer
Index(es):
- Date
- Thread