BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reminder -- RAID 5 is not your friend
- Subject: Reminder -- RAID 5 is not your friend
- From: kentborg-KwkGvOEf1og at public.gmane.org (Kent Borg)
- Date: Thu, 11 Mar 2010 11:53:52 -0500
- In-reply-to: <20100311042755.GO14999-mzpnVDyJpH4k7aNtvndDlA@public.gmane.org>
- References: <20100311042755.GO14999@tao.merseine.nu>
Dan Ritter wrote: > Before we can get around to it, another disk in the storage > system also dies. Poof. The ironic thing that worries me about raid is the danger in using the best available disks: if the whole array has matching disks, all correctly manufactured to spec, all the same age, all from the same factory batch, and all installed at the same time...it makes sense to me that more than one of them might wear out at about the same time. One might install a new disk and as the array is being rebuilt the rebuild stress on the other old disks is too much and another one dies. Personally I only have experience with Linux software raid 1, and in every case I have bought pairs of mismatched disks. Same capacity (roughly) but different manufacturers. I then format to a size slightly smaller than either so that a replacement disk that might be yet a different model will most likely be big enough. I think smartmon is your friend to get warning of aging disks: it is wise to replace disks before they actually fail. Maybe once smartmon starts to show much aging replace only half of the pair and then schedule maybe 9-months later to replace the other half. Why 9-months? No good reason. Only that 6-months seems too close to "now" to get much staggering of disk ages and waiting a whole year might be long enough to have a failure in the other disk. It is probably good to both use raid and try to replace disks before they actually fail. That way one is using raid as an extra safety measure and not as ones only precaution. I admit I have not been this disciplined. Also, heat is the enemy of disk drives. One machine that I set up lives in a house in LA and has a cron task that checks disk temperatures once an hour and if too hot it logs that complaint and hibernates the machine. Before I did that that machine lost both disks in very quick succession. (Luckily one was mostly still readable.) As for dedicated raid hardware, I agree the good stuff is expensive, and I have never had a chance to play with it. But it worries me, too. Beige boxes are comforting in two ways: 1. I trust that Linux software raid is quite well debugged, I do not trust that all hardware raid is similarly bug-free. 2. Every component in a boring beige Linux box can be easily replaced either overnight or at Microcenter, whichever is most convenient. Exactly matched replacement raid hardware is possibly much more difficult to locate, and non-matching hardware might not understand the disk format used by the dead component. Also, because beige boxes are cheap it is practical to have 100% complete hardware redundancy. How many organizations have the budget for TWO complete sets of their "enterprise" hardware? And if they don't, should they really trust that *nothing* can stop that storage box? Using beige hardware it is affordable to have several TB of raid 1 in a single box with a complete spare sitting next to it: resulting in an installation with no single points of failure. No single points other than: incoming network, power, external DNS dependencies, domain ownership problems, fire, lightning, flood, collapsed roofs, court orders, theft, sabotage, your own fat-fingered administration, software bugs, etc. I suggest physically separating paired hardware by a few feet so that small pyrotechnic and soda pop events can't easily take down everything. Ping-ponged off-site backups are good, too. A downside to Linux software raid is that it is complicated, the tools are confusing to use. It is good to have clear procedures figured out in advance, for all the likely scenarios. Fill in your particulars (partition numbers, etc.). Have this in a printed form, along with a rescue CD, next to each box. Before going to enterprise hardware I would look at larger architecture issues and see what can be done with multiple instances of beige hardware (and near-beige hardware--e.g., Supermicro servers; hot-swap disks and hot-swap redundant power supplies might be worth it). Remember, from what we know of Google they built a rather large operation on beige hardware, and are now using customized hardware that is still very close to beige. It is possible that they have no enterprise hardware, or maybe only for doing payroll and only then because the suits purchased it. -kb, the Kent who thinks reliability issues are tricky and need to be thought through *very* carefully.
- References:
- Reminder -- RAID 5 is not your friend
- From: dsr-mzpnVDyJpH4k7aNtvndDlA at public.gmane.org (Dan Ritter)
- Reminder -- RAID 5 is not your friend
- Prev by Date: ecc mode in memory testing
- Next by Date: Reminder -- RAID 5 is not your friend
- Previous by thread: Reminder -- RAID 5 is not your friend
- Next by thread: Reminder -- RAID 5 is not your friend
- Index(es):