Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

admins worst nightmare...



The only time I've ever had something like this happen, I had a failing hard
disk.  It was either writing or reading bits incorrectly, but not detecting
the failure.  I also discovered the problem only by noticing md5sums failing
to match.  This is something that's really nice about ZFS, always
checksumming and checking checksums during normal operation.  Unfortunatley
that won't help you in your present situation though ...

Here is what I suggest.  

Before you begin, just do an "ls -l" on the two files.  Do the sizes match?
You may have a "cp" that's terminating early for some reason ... out of
memory?

#1 Face the fact that the longer you leave the system operating this way,
the more bad data you'll have.  Face up to the fact that you need downtime.
#2 Power off.  The longer the bad disk or disks run, or memory, or cpu, or
whatever is failing... the more harm they cause.
#3 Get a spare disk and a spare system.  Take all the disks out of the first
system, and one-by-one "dd" the entire drive onto the spare.  Then perform a
surface scan of the first disk.  Repeat with each disk until or unless you
find a bad disk.  This will take forever.
#4 While that is running, get memtest86 (or whatever is the modern day
equivalent) and run it on your original system.
#5 Stop your fsck, and don't do it, until after you find the failing part or
parts.



> -----Original Message-----
> From: discuss-bounces-mNDKBlG2WHs at public.gmane.org [mailto:discuss-bounces-mNDKBlG2WHs at public.gmane.org] On
> Behalf Of Stephen Adler
> Sent: Monday, March 08, 2010 3:44 PM
> To: Blu unix (blu)
> Subject: admins worst nightmare...
> 
> Guys,
> 
> I discovered that my 6 terabyte file does not seem to be working
> properly... I did a copy of a gigabyte sized file to find that the
> original and copied md5sum's to differ.... uggg.... I'm doing a
> filesystem check right now, but I'm wondering if you guys have any
> thoughts on what may be going on with the file system. It's an ext3
> file
> system mapped over a software raid 5 raid array. When I created the
> file
> system, I used all the default mkfs parameters for a journaled file
> system. (i.e. mkfs -J /dev/md127; where md127 is my raid device.)
> 
> When I checked a small file, several megabytes in size, the copy and
> original had the same md5sum.
> 
> Is there a tool which will give you a summary of how different binary
> files are? will diff work on a gigabyte sized file?
> 
> Cheers. Steve.
> 
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss







BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org