BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VIrtualization HA challenge

Subject: VIrtualization HA challenge
From: kentborg-KwkGvOEf1og at public.gmane.org (Kent Borg)
Date: Mon, 17 Jan 2011 14:34:58 -0500
In-reply-to: <b734bc194ea843e9fa9a90f7bdd7ea8f.squirrel-2RFepEojUI3hzJAekONQAQ@public.gmane.org>
References: <b734bc194ea843e9fa9a90f7bdd7ea8f.squirrel@webmail.ci.net>

Rich Braun wrote:
> - Two multi-core systems with two NIC cards and 4-6 SATA drives, 16-32GB of RAM.
> - Xen or VirtualBox virtualization.
> - Two ethernet switches, at least one of which is jumbo-frame capable (for use
> as a SAN switch).
> - Open-source SAN with automatic failover of all the storage contained in
> these systems.
> - Virtual machines capable of running on either of the two hosts.

I specc-ed out something similar to that at a previous job, but never 
got the go-ahead to build it. (Actually, I did finally get the go-ahead, 
but the economy was crumbling and the company couldn't get credit 
approval on the equipment lease.)

My idea was two identical multi-core machines, each with a lot of RAM 
and lots of disk capacity (at least 4 physical disks in each box, half 
one manufacturer, half another manufacturer), each using one ethernet 
port to talk to the world and each using a second gigabit ethernet port 
doing jumbo packets to talk to the other machine on a crossover cable 
(no switch for just two machines). Possibly use two gigabit links 
between the two machines if DRBD secondary writes prove to be a bottleneck.

My intended stack was:

- KVM
- DRBD split between local disk as primary and remote as secondary
- local disk - and - cross-over cable to other box
- LVM, to let me resize and snapshot things (wish btrfs were more mature 
and distributed)
- disks used half for local storage, half for remote storage, raid 10 on 
everything

I don't think there was any AoE in my design once I discovered DRBD.

A given virtual machine could run on either box. The local disk would be 
marked primary on DRBD, so reads would be fast. Writes would go to the 
local DRBD device and the remote DRBD device (secondary) via the 
cross-over cable to other box. For a conservative VM DRDB would be 
configured to block until both the local and remote were written, but 
for some VMs it might be okay to let the remote disk lag. (Faster 
writing, more lost data in a failure, still ability to migrate.)

To migrate apparently both sides get set to primary, tell kvm to 
migrate, then switch the old machine back to secondary.

The Supermicro boxes I had selected had dual hot-swapable power 
supplies, ECC RAM, hot pluggable disks. I would separate them physically 
as far apart as rack space permitted.

My host OS install would have dual "/" partitions with grub set up to 
let me boot from either. A script to identify which "/" is current and a 
script to rsync the current "/" across to the other "/". Before any 
risky OS manipulations, rsync the current to the other "/", after the 
risky work...leave everything alone in case a few hours or days from now 
a problem is discovered--the other "/" is usually lagging.

If each machine were capable of running everything then host OS updates 
can happen at the expense of not being able to migrate during the 
maintenance window, but nothing is required to stop.

I think I would keep all the live VMs running on a single machine, 
leaving the other machine in warm spare mode, with plenty of CPU 
available, making it a place to stage changes in specific VMs.

The result was very few single points of failure:

- cross-over cable (breaks redundancy, but doesn't bring down services 
except in split-brain case if auto HA setup or admins are dim)

- software bugs in host OS, configuration, custom scripts (hard to avoid 
in an HA installation that is tightly coupled, be careful)

- power can fail (physically route dual power carefully, possibly run 
own local UPSes on one leg if permitted)

- facility could burn down, flood, over-heat, collapse in earthquake, be 
robbed, vandalized, sabotaged (inside job or not)

- you could be manually shutdown over unpaided bill, administrative 
mistake, DMCA stupidities, or court order

- fratricide: one of your boxes could fail in a pyrotechnic way (put at 
least a few feet distance between the two)

- unified administration makes a single fat finger dangerous (have 
defensive procedures, for example: use sudo, have other "/" not mounted 
or mounted readonly, have tested scripts for common operations that 
might be easy to do manually but also easy to get slightly wrong, have 
clear and precise upgrade/rollback plans and checklists before plunging 
in, use two sets of eyes to vet commands before "enter" is pressed, have 
common and emergency procedures documented and maintained, maintain 
documentation of your configuration's vital statistics, keep a 
maintenance log...etc, be as extreme as you need to be, depending on how 
much paranoia is warranted)

For not that much money one can build a pair of boxes that are more 
reliable than their admins. Cheap enough that a complete second 
non-production staging copy is a pretty cheap way to add maybe almost 
another nine. (Not for sure. A complete second copy is a safe place to 
do destructive things...except if an admin accidentally types into the 
wrong window...distinct prompts sound like a good start, maybe some anal 
rules and mechanisms prohibiting logins to both sites at the same time.)

> It would be a bit of a challenge to build this using /four/ machines (a pair
> each for storage and for virtualization) but doing this on two would make it a
> killer-app platform.

Last I looked DRBD was commercial for more than two-nodes. But it looked 
like a good product and likely worth it.

> I say it's "tantalizing", though, after getting various pieces work
> individually but not quite integrated:  AoE (ATA-over-Ethernet), OCFS2, DRBD,
> VirtualBox.

I had OCFS2 in my design but it was pointed out to me that I can just 
serve up a /dev/drbdX device directly to a VM. If only one instance of 
that VM is running (and primary) on that device at a time, then no 
OCFS2-like layer is needed. Yes, if you have an application layer that 
wants a shared file system with another instance, then OCFS2 is useful, 
but that sounds like a larger-than-one-machine cluster, whereas a set of 
VMs are probably smaller-than-one-machine. Even if different VMs have 
some files they want to share, that still sounds like a single volume 
for that purpose, but not everything.

-kb, the Kent who would still like to build such a box.

References:
- VIrtualization HA challenge
  - From: richb-RBmg6HWzfGThzJAekONQAQ at public.gmane.org (Rich Braun)

Prev by Date: Suggestion for KVM over IP
Next by Date: Suggestion for KVM over IP
Previous by thread: VIrtualization HA challenge
Next by thread: VIrtualization HA challenge
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org