admins worst nightmare...

Edward Ned Harvey blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org
Mon Mar 8 22:45:59 EST 2010


The only time I've ever had something like this happen, I had a failing hard
disk.  It was either writing or reading bits incorrectly, but not detecting
the failure.  I also discovered the problem only by noticing md5sums failing
to match.  This is something that's really nice about ZFS, always
checksumming and checking checksums during normal operation.  Unfortunatley
that won't help you in your present situation though ...

Here is what I suggest.  

Before you begin, just do an "ls -l" on the two files.  Do the sizes match?
You may have a "cp" that's terminating early for some reason ... out of
memory?

#1 Face the fact that the longer you leave the system operating this way,
the more bad data you'll have.  Face up to the fact that you need downtime.
#2 Power off.  The longer the bad disk or disks run, or memory, or cpu, or
whatever is failing... the more harm they cause.
#3 Get a spare disk and a spare system.  Take all the disks out of the first
system, and one-by-one "dd" the entire drive onto the spare.  Then perform a
surface scan of the first disk.  Repeat with each disk until or unless you
find a bad disk.  This will take forever.
#4 While that is running, get memtest86 (or whatever is the modern day
equivalent) and run it on your original system.
#5 Stop your fsck, and don't do it, until after you find the failing part or
parts.



> -----Original Message-----
> From: discuss-bounces-mNDKBlG2WHs at public.gmane.org [mailto:discuss-bounces-mNDKBlG2WHs at public.gmane.org] On
> Behalf Of Stephen Adler
> Sent: Monday, March 08, 2010 3:44 PM
> To: Blu unix (blu)
> Subject: admins worst nightmare...
> 
> Guys,
> 
> I discovered that my 6 terabyte file does not seem to be working
> properly... I did a copy of a gigabyte sized file to find that the
> original and copied md5sum's to differ.... uggg.... I'm doing a
> filesystem check right now, but I'm wondering if you guys have any
> thoughts on what may be going on with the file system. It's an ext3
> file
> system mapped over a software raid 5 raid array. When I created the
> file
> system, I used all the default mkfs parameters for a journaled file
> system. (i.e. mkfs -J /dev/md127; where md127 is my raid device.)
> 
> When I checked a small file, several megabytes in size, the copy and
> original had the same md5sum.
> 
> Is there a tool which will give you a summary of how different binary
> files are? will diff work on a gigabyte sized file?
> 
> Cheers. Steve.
> 
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss






More information about the Discuss mailing list