RAID--quick, cheap, good/fast, pick all 3
Rich Braun
richb at pioneer.ci.net
Thu Sep 4 10:58:31 EDT 2003
Ron Peterson suggested:
> I've been using mdadm rather than raidtools lately. I like it.
Thanks! The BLU list is working just great--post suggestions, get more in
return. I wouldn't have thought to look for mdadm, but it does what's been
lacking with my 3Ware controller--alerting me to failures when they happen,
rather than a few months later when I happen to check.
Here are my notes and comments for others contemplating hauling out that
still-serviceable but you-thought-it-was-too-small hard drive to use as a
RAID1 spare:
1. Set up mdadm monitoring first
2. Test your setup
A cookbook for doing these follows. Oh, and to the fellow who suggested that
hardware RAID might be faster at rebuilding, here is a benchmark on my system:
9 gigs re-synced in 10 minutes and 35 seconds. To my mind, rebuilding
performance is not especially important because it happens in background and
takes a lot less time (during which you're at risk of a 2nd drive failure)
than the process of going out to the store and buying a replacement of the
drive which failed.
Suse 8.2 comes with mdadm version 1.1.0. You can get 1.3.0 from wherever the
mdadm distribution site is (fixes a minor parsing bug--you no longer have to
specify a device name e.g. /dev/md0 on the command line in daemonize mode).
Setting up mdadm:
- Create the file /etc/mdadm.conf:
echo /dev/hda2 /dev/hdc2 >/etc/mdadm.conf
mdadm --detail --scan >>/etc/mdadm.conf
echo MAILADDR root >>/etc/mdadm.conf
- Verify your configuration by typing 'mdadm --detail /dev/md0':
/dev/md0:
Version : 00.90.00
Creation Time : Wed Sep 3 18:18:12 2003
Raid Level : raid1
Array Size : 9422016 (8.99 GiB 9.65 GB)
Device Size : 9422016 (8.99 GiB 9.65 GB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu Sep 4 10:20:51 2003
State : dirty, no-errors
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Number Major Minor RaidDevice State
0 22 2 0 active sync /dev/hdc2
1 3 2 1 active sync /dev/hda2
UUID : c2cc6135:4e5560c7:6204121b:f92a7259
Events : 0.16
(The above is a little bit incorrect--I don't have a 3rd, failed device--maybe
a bug, maybe an artifact of the fact that I created the array with mkraid
instead of mdadm.)
- Create an rc script which invokes the mdadm daemon with the following
command line options:
MDADM_BIN=/sbin/mdadm
MDADM_ARGS='--monitor --delay=300 --scan --daemonize /dev/md0'
Make sure it gets started at boot time (double-check later that the mdadm
process is running 24/7).
---
Testing your setup:
This took me about 40 minutes, could have been quicker had I not run into what
appears to be a minor kernel bug.
- Yank the power cable or otherwise disable one member of the RAID array. See
in /var/log/messages:
Sep 4 09:40:52 envoy kernel: hdc: drive not ready for command
Sep 4 09:40:52 envoy kernel: hdc: status error: status=0x00 { }
Sep 4 09:40:52 envoy kernel: raid1: Disk failure on hdc2, disabling device.
Sep 4 09:40:52 envoy kernel: Operation continuing on 1 devices
Your system should stay up and running without anything else happening beyond
the syslog entry.
- Verify notification: an email should pop up:
Subject: Fail event on /dev/md0:<hostname>
Date: Thu, September 4, 2003 9:41 am
This is an automatically generated mail message from mdadm
running on <hostname>
A Fail event had been detected on md device /dev/md0.
It could be related to componenet device /dev/hdc2.
Faithfully yours, etc.
- Another useful command is 'mdadm --detail /dev/md0' which will now give the
following for your devices:
Number Major Minor RaidDevice State
0 22 2 0 faulty /dev/hdc2
1 3 2 1 active sync /dev/hda2
- OK now we're satisfied that RAID is doing the job. (By the way, I tested
this using the *other* drive besides the one that was sync'ed during the
installation procedure described yesterday. That way I know that I can boot
up and run on either drive.)
- Now plug the power cable back in and try to re-sync the drive. I got the
following kernel message:
# mdadm /dev/md0 --add /dev/hdc2
mdadm: hot add failed for /dev/hdc2: Device or resource busy
The entry in /var/log/messages was:
Sep 4 10:00:57 envoy kernel: md: trying to hot-add hdc2 to md0 ...
Wound up having to reboot to clear the busy condition. Rebooting at this
point is a smart idea for your test anyway, just to make sure you can boot on
the single drive. (NOTE: I am only using RAID for my *root* filesystem. My
*boot* filesystem is a regular ext2 partition. I'm not yet ready to try to
figure out how to make grub do the right thing, having just learned it a
couple days ago.)
- After rebooting, you will need to manually add back in the failed RAID
partition (unless you've set up a script to do this for you):
# mdadm /dev/md0 --add /dev/hdc2
This time it worked, and the /var/log/messages entry starts with:
Sep 4 10:10:16 envoy kernel: md: trying to hot-add hdc2 to md0 ...
Sep 4 10:10:16 envoy kernel: md: bind<hdc2,2>
Sep 4 10:10:16 envoy kernel: RAID1 conf printout:
Sep 4 10:10:16 envoy kernel: --- wd:1 rd:2 nd:2
Sep 4 10:10:16 envoy kernel: disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev 00:00]
...
Sep 4 10:10:16 envoy kernel: md: syncing RAID array md0
Sep 4 10:10:16 envoy kernel: md: minimum _guaranteed_ reconstruction speed:
100 KB/sec/disc.
Sep 4 10:10:16 envoy kernel: md: using maximum available idle IO bandwith
(but not more than 100000 KB/sec) for reconstruction.
Sep 4 10:10:16 envoy kernel: md: using 124k window, over a total of 9422016
blocks.
and ends with:
Sep 4 10:20:51 envoy kernel: md: md0: sync done.
Sep 4 10:20:51 envoy kernel: RAID1 conf printout:
...
Sep 4 10:20:51 envoy kernel: md: updating md0 RAID superblock on device
Sep 4 10:20:51 envoy kernel: md: hdc2 [events: 00000010]<6>(write) hdc2's sb
offset: 9422016
Sep 4 10:20:51 envoy kernel: md: hda2 [events: 00000010]<6>(write) hda2's sb
offset: 9437312
Sep 4 10:20:51 envoy kernel: md: recovery thread finished ...
- While the re-sync is going on, a handy command to know is:
#cat /proc/mdstat
Personalities : [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 hdc2[2] hda2[1]
9422016 blocks [2/1] [_U]
[=================>...] recovery = 89.9% (8479864/9422016) finish=1.0min
speed=14425K/sec
unused devices: <none>
That, along with the constant-on hdd LED on you front panel, shows you that
it's working.
- Do a 'ps ax | grep mdadm' to make sure your monitoring daemon's running.
---
That's it! Not a whole lot to it, really.
Perhaps another contributor to the list can comment on why I got the
drive-busy error when attempting to bring the failed device back online? Is
it a BIOS issue (motherboard IDE doesn't support hot-plugging, perhaps?), or a
kernel issue?
-rich
More information about the Discuss
mailing list