Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

admins worst nightmare...



On Tue, Mar 9, 2010 at 10:07 AM, Stephen Adler <adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org> wrote:
>...
>
> The backups are made by copying the original data off the sun blade
> system to backup system #1. I then run a nightly cron job which rsyncs
> the data from backup system #1 to backup systems #2 and #3. The problem
> is that backup system #1 has the corrupted file system, so once the
> files were placed on backup system #1 and corrupted, I ended up copying
> corrupted data off on to backup systems #2 and #3.
>
>...
> tera byte drives running a software raid 5 raid array. Also, I have the
> smartd tools running doing nightly and weekly checks. With all that in
> place, there were no warning of errors on the file system. Which makes
> me think there is a bug in ext3/md raid5 or the PCI esata controller
> card is mucked up. I still have to very the memory, which is supposed to
> be ECC memory.

Don't forget that ECC just reduces the probability that bad data will
be read from memory not eliminate it.

Also, you are assuming that the problem here is in system #1.   It
could very well be that the error occurred during the network transfer
when the data was copied from the original system to system #1.  I
can't remember who it was now,
but there was someone from AT&T who gave a talk at BBLISA a few years
back about always doing checksums of files
every time he moved/copied them between networked system .   As  I
recall, he found plenty of errors on 'working' systems which were
handling massive amounts of data.

>
> I tried to do my homework in setting up this backup system, and with all
> the redundancy I put in, I thought I didn't need the md5 check sum.
> Well... I've learned my lesson the hard way.
>
> So... the lesson learned...
>
> ALWAYS DO MD5 CHECK SUMS ON CRITICAL DATA DURING BACKUPS NO MATTER HOW
> LONG IT TAKES, BEFORE YOU DELETE THE ORIGINAL DATA.

And do it every time you transfer that data to a new location.   And
if you are really paranoid, modify your applications
to checksum the data as they read it off of the disk for processing.
Of course this will either require modifying your data file format or
adding an additional auxilliary file in which you store the block(?)
level checksums.   Kind of like what ZFS does, but
extends the protection to transfers across networks or even to storage
on backup media.

Bill Bogstad






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org