BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

admins worst nightmare...

Subject: admins worst nightmare...
From: adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org (Stephen Adler)
Date: Tue, 09 Mar 2010 10:07:26 -0500
In-reply-to: <000201cabf3b$0b52c570$21f85050$@com>
References: <4B956177.2010908@stephenadler.com> <000201cabf3b$0b52c570$21f85050$@com>

Hi Edward, these are very good bits of advice, (and thanks to the 
other's who replied...)

Here is where I stand now, for those of you who are curious...

In setting up my backup system, I had setup a duplicate redundant system 
to take care of this kind of failure. Thus, when I do the backup, I 
actually copy the data to a total of three separate systems. The problem 
with my design is that I skipped doing md5 checksums during the backups 
and only counted bytes copied. This was to save time. The data is 
generated on an old Sun Blade system and its *really slow* when doing 
check sums. (I'm copying around 10 gigabytes of data at a time...) So... 
with this design flaw in place, here is how it played out.

The backups are made by copying the original data off the sun blade 
system to backup system #1. I then run a nightly cron job which rsyncs 
the data from backup system #1 to backup systems #2 and #3. The problem 
is that backup system #1 has the corrupted file system, so once the 
files were placed on backup system #1 and corrupted, I ended up copying 
corrupted data off on to backup systems #2 and #3.

I discovered the data corruption when I went to restore the data to 
reprocess it, and the reprocessing was failing due to the corrupted data.

When I put the backup system in place, I performed a series of 
validation tests, and the file system on backup system #1 was working 
since I was able to perform restores and data reprocessing successfully. 
But that was then, and now is now...

What I've done now is bypass backup system #1 and effectivly make backup 
system #2 my primary backup, so the data is now going from sun system to 
backup system #2 directly. I'm going to take backup system #1 offline 
and perform the tests you have suggested, including memx86 etc.

More particulars.... Backup system #1 is a Dell server which I bought 
about 5 months ago. Red Hat Enterprise Linux 5.4 did not run well on it, 
so I installed fedora 12. Backup system #2 is running RHEL 5.4. 
Furthermore, backup system #1 (Fedora 12) has 7 1Tera seagate enterprise 
disks, (the ones with the long mean time between failures.) But 4 of 
those disks are hanging off the dell system via external sata 
connections. I bought a pci sata controller with 4 external sata 
connectors. Finally, the 6 tera byte file system is made up of the 7 
tera byte drives running a software raid 5 raid array. Also, I have the 
smartd tools running doing nightly and weekly checks. With all that in 
place, there were no warning of errors on the file system. Which makes 
me think there is a bug in ext3/md raid5 or the PCI esata controller 
card is mucked up. I still have to very the memory, which is supposed to 
be ECC memory.

I tried to do my homework in setting up this backup system, and with all 
the redundancy I put in, I thought I didn't need the md5 check sum. 
Well... I've learned my lesson the hard way.

So... the lesson learned...

ALWAYS DO MD5 CHECK SUMS ON CRITICAL DATA DURING BACKUPS NO MATTER HOW 
LONG IT TAKES, BEFORE YOU DELETE THE ORIGINAL DATA.

Cheers. Steve.

On 03/08/2010 10:45 PM, Edward Ned Harvey wrote:
> The only time I've ever had something like this happen, I had a failing hard
> disk.  It was either writing or reading bits incorrectly, but not detecting
> the failure.  I also discovered the problem only by noticing md5sums failing
> to match.  This is something that's really nice about ZFS, always
> checksumming and checking checksums during normal operation.  Unfortunatley
> that won't help you in your present situation though ...
>
> Here is what I suggest.
>
> Before you begin, just do an "ls -l" on the two files.  Do the sizes match?
> You may have a "cp" that's terminating early for some reason ... out of
> memory?
>
> #1 Face the fact that the longer you leave the system operating this way,
> the more bad data you'll have.  Face up to the fact that you need downtime.
> #2 Power off.  The longer the bad disk or disks run, or memory, or cpu, or
> whatever is failing... the more harm they cause.
> #3 Get a spare disk and a spare system.  Take all the disks out of the first
> system, and one-by-one "dd" the entire drive onto the spare.  Then perform a
> surface scan of the first disk.  Repeat with each disk until or unless you
> find a bad disk.  This will take forever.
> #4 While that is running, get memtest86 (or whatever is the modern day
> equivalent) and run it on your original system.
> #5 Stop your fsck, and don't do it, until after you find the failing part or
> parts.
>
>
>
>    
>> -----Original Message-----
>> From: discuss-bounces-mNDKBlG2WHs at public.gmane.org [mailto:discuss-bounces-mNDKBlG2WHs at public.gmane.org] On
>> Behalf Of Stephen Adler
>> Sent: Monday, March 08, 2010 3:44 PM
>> To: Blu unix (blu)
>> Subject: admins worst nightmare...
>>
>> Guys,
>>
>> I discovered that my 6 terabyte file does not seem to be working
>> properly... I did a copy of a gigabyte sized file to find that the
>> original and copied md5sum's to differ.... uggg.... I'm doing a
>> filesystem check right now, but I'm wondering if you guys have any
>> thoughts on what may be going on with the file system. It's an ext3
>> file
>> system mapped over a software raid 5 raid array. When I created the
>> file
>> system, I used all the default mkfs parameters for a journaled file
>> system. (i.e. mkfs -J /dev/md127; where md127 is my raid device.)
>>
>> When I checked a small file, several megabytes in size, the copy and
>> original had the same md5sum.
>>
>> Is there a tool which will give you a summary of how different binary
>> files are? will diff work on a gigabyte sized file?
>>
>> Cheers. Steve.
>>
>> _______________________________________________
>> Discuss mailing list
>> Discuss-mNDKBlG2WHs at public.gmane.org
>> http://lists.blu.org/mailman/listinfo/discuss
>>      
>

Follow-Ups:
- admins worst nightmare...
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)

References:
- admins worst nightmare...
  - From: adler-wRvlPVLobi1/31tCrMuHxg at public.gmane.org (Stephen Adler)
- admins worst nightmare...
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)

Prev by Date: OpenOffice and Microsoft Office
Next by Date: OpenOffice and Microsoft Office
Previous by thread: admins worst nightmare...
Next by thread: admins worst nightmare...
Index(es):
- Date
- Thread