Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
On an Ubuntu Feisty system, I received notice of a degraded RAID array after rebooting today. Investigating showed: # mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Fri Jan 26 16:20:26 2007 Raid Level : raid5 ... Raid Devices : 4 Total Devices : 3 Preferred Minor : 1 Persistence : Superblock is persistent ... State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 ... Number Major Minor RaidDevice State 0 254 4 0 active sync /dev/mapper/sda1 1 0 0 1 removed 2 254 5 2 active sync /dev/mapper/sdc1 3 254 6 3 active sync /dev/mapper/sdd1 If it was a hardware problem or otherwise a problem with the physical drive, I'd expect it to show up as "failed" rather than "removed." No complaints when the device was re-added: # mdadm -v /dev/md1 --add /dev/mapper/sdb1 mdadm: added /dev/mapper/sdb1 but it troubles me that it just disappeared on its own. dmesg doesn't seem to show anything interesting, other than the lack of sdb1 being picked up by md: # dmesg | fgrep sd ... [ 35.520480] sdb: Write Protect is off [ 35.520483] sdb: Mode Sense: 00 3a 00 00 [ 35.520496] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 35.520542] SCSI device sdb: 625142448 512-byte hdwr sectors (320073 MB) [ 35.520550] sdb: Write Protect is off [ 35.520552] sdb: Mode Sense: 00 3a 00 00 [ 35.520564] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 35.520567] sdb: sdb1 [ 35.538213] sd 1:0:0:0: Attached scsi disk sdb ... [ 35.939614] md: bind<sdc1> [ 35.939797] md: bind<sdd1> [ 35.939942] md: bind<sda1> [ 49.731674] md: unbind<sda1> [ 49.731684] md: export_rdev(sda1) [ 49.731707] md: unbind<sdd1> [ 49.731711] md: export_rdev(sdd1) [ 49.731722] md: unbind<sdc1> [ 49.731726] md: export_rdev(sdc1) Other than the DegradedArray event, /var/log/daemon.log doesn't show anything interesting. smartd didn't report any problems with /dev/sdb. Then again, while looking into this I found: smartd[6370]: Device: /dev/hda, opened smartd[6370]: Device: /dev/hda, found in smartd database. smartd[6370]: Device: /dev/hda, is SMART capable. Adding to "monitor" list. ... smartd[6370]: Device: /dev/sda, opened smartd[6370]: Device: /dev/sda, IE (SMART) not enabled, skip device Try 'smartctl -s on /dev/sda' to turn on SMART features ... smartd[6370]: Device: /dev/sdb, IE (SMART) not enabled... smartd[6370]: Device: /dev/sdc, IE (SMART) not enabled... smartd[6370]: Device: /dev/sdd, IE (SMART) not enabled... smartd[6370]: Monitoring 1 ATA and 0 SCSI devices So it looks like the drives in the RAID array weren't being monitored by smartd. Running the suggested command: # smartctl -s on /dev/sda smartctl version 5.36 ... unable to fetch IEC (SMART) mode page [unsupported field in scsi command] A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Seems it doesn't like these SATA drives. I'll have to investigate further... I've noticed the device names have changed as of a reboot last weekend. Probably due to upgrades to the udev system. The array was originally setup with /dev/sda1 ... /dev/sdd1 and the output from /proc/mdstat prior to a reboot last week showed: md1 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1] and now shows: md1 : active raid5 dm-7[4] dm-6[3] dm-5[2] dm-4[0] but if that was the source of the problem, I'd expect it to throw off all the devices, not just one of the drives. It may be relevant to note that the drive was initially created in a degraded state (4 device array with only 3 devices active), with the 4th device being added just prior to the previous reboot. But the added device was /dev/sda1, not /dev/sdb1. I've also noticed on reboot a message on the console that says something like no RAID arrays found in mdadm.conf during the last couple of reboots, but as that file has been updated to reflect the current output of "mdadm --detail --scan" and the array has been functioning, I've ignored it. However, while investigating the above I noticed: mythtv:/etc# dmesg | fgrep md: [ 31.069854] md: raid1 personality registered for level 1 [ 31.651721] md: raid6 personality registered for level 6 [ 31.651723] md: raid5 personality registered for level 5 [ 31.651724] md: raid4 personality registered for level 4 [ 35.710310] md: md0 stopped. [ 35.793291] md: md1 stopped. [ 35.939614] md: bind<sdc1> [ 35.939797] md: bind<sdd1> [ 35.939942] md: bind<sda1> [ 36.251952] md: array md1 already has disks! [...80 more identical messages deleted...] [ 49.476995] md: array md1 already has disks! [ 49.731660] md: md1 stopped. [ 49.731674] md: unbind<sda1> [ 49.731684] md: export_rdev(sda1) [ 49.731707] md: unbind<sdd1> [ 49.731711] md: export_rdev(sdd1) [ 49.731722] md: unbind<sdc1> [ 49.731726] md: export_rdev(sdc1) [ 51.613310] md: bind<dm-4> [ 51.618923] md: bind<dm-5> [ 51.632529] md: bind<dm-6> [ 51.714527] md: couldn't update array info. -22 [ 51.714580] md: couldn't update array info. -22 The "array md1 already has disks" messages as well as the repeated starting/stopping and binding and unbinding seems to suggest that something isn't quite right. Although maybe some of this is by design. I see in /etc/default/mdadm: # list of arrays (or 'all') to start automatically when the initial ramdisk # loads. This list *must* include the array holding your root filesystem. Use # 'none' to prevent any array from being started from the initial ramdisk. INITRDSTART='all' so maybe the array is being initially setup by initrd, and then being setup again at a later stage. This system doesn't have its root file system on the array, so I'm going to switch 'all' to 'none'. That still leaves me without a likely cause for why the drive disappeared from the array. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |