Redundant Web servers

Mon Dec 14 18:39:46 EST 2009

A plan to improve web server availability:

Assume one has a backup webserver with similar capacity to the production 
server. Redundancy can be achieved by adding a second A record to the www 
record pointing to the backup webserver. Then the DNS server will return 
both records for each query, in random order. If both webservers are up, 
obviously no problem. If one is down, we anticipate that the browsers will 
go on to the 2nd address if the first server address does not respond. The 
question is, how universal is this ability? I have been doing some 
experiments and I believe it is near universal among recent browsers.

Using the latest versions of six browsers on various machines about the 
office

   MSIE       8.0.6001.18702
   Opera      10.10
   Safari     4.0.4 531.21.10
   Firefox    3.5.30729
   Chrome     4.0.239.30
   Konqueror  4.3.3

the worst result when one server was down was a delay of about 30 seconds 
before the page was loaded. I conclude that some browsers have a 30 second 
timeout before trying the next IP address. FF 2.0 and Lynx never switched. 
I was able to test Safari 4.0, FF 2.0 and Chrome 3.0 on Adobe Browserlab, 
and those all failed to switch within the Browserlab timeout. On my PC, 
Safari 4.0.4 did manage to switch after about 30 seconds, so perhaps 
Browserlab is to quick to give up.

During periods when one server was down, users of non-switching browsers 
would have a 50% chance of getting the bad server in an individual browser 
session, but the chance of one of two servers being down is about double 
the chance of one server being down. This is close to a wash then, for the 
older browsers and a pretty big win for the newer ones. It is true that a 
user with an older browser could close his browser and wait 5 minutes (our 
DNS TTL) for another chance, but probably most users wouldn't do that.

This isn't perfect, but I think achieves some valuable redundancy at low 
cost, and does not introduce any single point of failure that didn't exist 
before. It does not require any special topology, hardware or skills 
either.

On our web site all internal links are relative. This means that once a 
browser session finds a working server, it will stay with that server - so 
there is only one delay per visitor. If the links were absolute, then 
there would be a delay on 50% of page views, not very attractive.