Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
Ron Peterson suggested: > I've been using mdadm rather than raidtools lately. I like it. Thanks! The BLU list is working just great--post suggestions, get more in return. I wouldn't have thought to look for mdadm, but it does what's been lacking with my 3Ware controller--alerting me to failures when they happen, rather than a few months later when I happen to check. Here are my notes and comments for others contemplating hauling out that still-serviceable but you-thought-it-was-too-small hard drive to use as a RAID1 spare: 1. Set up mdadm monitoring first 2. Test your setup A cookbook for doing these follows. Oh, and to the fellow who suggested that hardware RAID might be faster at rebuilding, here is a benchmark on my system: 9 gigs re-synced in 10 minutes and 35 seconds. To my mind, rebuilding performance is not especially important because it happens in background and takes a lot less time (during which you're at risk of a 2nd drive failure) than the process of going out to the store and buying a replacement of the drive which failed. Suse 8.2 comes with mdadm version 1.1.0. You can get 1.3.0 from wherever the mdadm distribution site is (fixes a minor parsing bug--you no longer have to specify a device name e.g. /dev/md0 on the command line in daemonize mode). Setting up mdadm: - Create the file /etc/mdadm.conf: echo /dev/hda2 /dev/hdc2 >/etc/mdadm.conf mdadm --detail --scan >>/etc/mdadm.conf echo MAILADDR root >>/etc/mdadm.conf - Verify your configuration by typing 'mdadm --detail /dev/md0': /dev/md0: Version : 00.90.00 Creation Time : Wed Sep 3 18:18:12 2003 Raid Level : raid1 Array Size : 9422016 (8.99 GiB 9.65 GB) Device Size : 9422016 (8.99 GiB 9.65 GB) Raid Devices : 2 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Sep 4 10:20:51 2003 State : dirty, no-errors Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Number Major Minor RaidDevice State 0 22 2 0 active sync /dev/hdc2 1 3 2 1 active sync /dev/hda2 UUID : c2cc6135:4e5560c7:6204121b:f92a7259 Events : 0.16 (The above is a little bit incorrect--I don't have a 3rd, failed device--maybe a bug, maybe an artifact of the fact that I created the array with mkraid instead of mdadm.) - Create an rc script which invokes the mdadm daemon with the following command line options: MDADM_BIN=/sbin/mdadm MDADM_ARGS='--monitor --delay=300 --scan --daemonize /dev/md0' Make sure it gets started at boot time (double-check later that the mdadm process is running 24/7). --- Testing your setup: This took me about 40 minutes, could have been quicker had I not run into what appears to be a minor kernel bug. - Yank the power cable or otherwise disable one member of the RAID array. See in /var/log/messages: Sep 4 09:40:52 envoy kernel: hdc: drive not ready for command Sep 4 09:40:52 envoy kernel: hdc: status error: status=0x00 { } Sep 4 09:40:52 envoy kernel: raid1: Disk failure on hdc2, disabling device. Sep 4 09:40:52 envoy kernel: Operation continuing on 1 devices Your system should stay up and running without anything else happening beyond the syslog entry. - Verify notification: an email should pop up: Subject: Fail event on /dev/md0:<hostname> Date: Thu, September 4, 2003 9:41 am This is an automatically generated mail message from mdadm running on <hostname> A Fail event had been detected on md device /dev/md0. It could be related to componenet device /dev/hdc2. Faithfully yours, etc. - Another useful command is 'mdadm --detail /dev/md0' which will now give the following for your devices: Number Major Minor RaidDevice State 0 22 2 0 faulty /dev/hdc2 1 3 2 1 active sync /dev/hda2 - OK now we're satisfied that RAID is doing the job. (By the way, I tested this using the *other* drive besides the one that was sync'ed during the installation procedure described yesterday. That way I know that I can boot up and run on either drive.) - Now plug the power cable back in and try to re-sync the drive. I got the following kernel message: # mdadm /dev/md0 --add /dev/hdc2 mdadm: hot add failed for /dev/hdc2: Device or resource busy The entry in /var/log/messages was: Sep 4 10:00:57 envoy kernel: md: trying to hot-add hdc2 to md0 ... Wound up having to reboot to clear the busy condition. Rebooting at this point is a smart idea for your test anyway, just to make sure you can boot on the single drive. (NOTE: I am only using RAID for my *root* filesystem. My *boot* filesystem is a regular ext2 partition. I'm not yet ready to try to figure out how to make grub do the right thing, having just learned it a couple days ago.) - After rebooting, you will need to manually add back in the failed RAID partition (unless you've set up a script to do this for you): # mdadm /dev/md0 --add /dev/hdc2 This time it worked, and the /var/log/messages entry starts with: Sep 4 10:10:16 envoy kernel: md: trying to hot-add hdc2 to md0 ... Sep 4 10:10:16 envoy kernel: md: bind<hdc2,2> Sep 4 10:10:16 envoy kernel: RAID1 conf printout: Sep 4 10:10:16 envoy kernel: --- wd:1 rd:2 nd:2 Sep 4 10:10:16 envoy kernel: disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00] ... Sep 4 10:10:16 envoy kernel: md: syncing RAID array md0 Sep 4 10:10:16 envoy kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc. Sep 4 10:10:16 envoy kernel: md: using maximum available idle IO bandwith (but not more than 100000 KB/sec) for reconstruction. Sep 4 10:10:16 envoy kernel: md: using 124k window, over a total of 9422016 blocks. and ends with: Sep 4 10:20:51 envoy kernel: md: md0: sync done. Sep 4 10:20:51 envoy kernel: RAID1 conf printout: ... Sep 4 10:20:51 envoy kernel: md: updating md0 RAID superblock on device Sep 4 10:20:51 envoy kernel: md: hdc2 [events: 00000010]<6>(write) hdc2's sb offset: 9422016 Sep 4 10:20:51 envoy kernel: md: hda2 [events: 00000010]<6>(write) hda2's sb offset: 9437312 Sep 4 10:20:51 envoy kernel: md: recovery thread finished ... - While the re-sync is going on, a handy command to know is: #cat /proc/mdstat Personalities : [raid0] [raid1] [raid5] read_ahead 1024 sectors md0 : active raid1 hdc2[2] hda2[1] 9422016 blocks [2/1] [_U] [=================>...] recovery = 89.9% (8479864/9422016) finish=1.0min speed=14425K/sec unused devices: <none> That, along with the constant-on hdd LED on you front panel, shows you that it's working. - Do a 'ps ax | grep mdadm' to make sure your monitoring daemon's running. --- That's it! Not a whole lot to it, really. Perhaps another contributor to the list can comment on why I got the drive-busy error when attempting to bring the failed device back online? Is it a BIOS issue (motherboard IDE doesn't support hot-plugging, perhaps?), or a kernel issue? -rich
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |