Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ECC RAM failure data - jre



On Thursday 26 February 2009 06:24:07 john_re wrote:
> Do you use ECC RAM? Do you have any data about failure rates?
> 
> I'm evaluating this for a system with 8GB DRAM, &
> http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction
> says
> "Tests[ecc]give widely varying error rates, but about 10-12upset/bit-hr
> is typical, roughly one bit error, per month, per gigabyte of memory.
> 
> In most computers used for serious scientific or financial computing and
> as servers, ECC is the rule rather than the exception, as can be seen by
> examining manufacturers' specifications."
> 
> 
> So, for that data 8GB DRAM is about 8 errors per month, ie about
> one per 3-4 days.
> 
> What rates do you have?

Back in the day, I worked for (the now defunct) Linux Networx, on site at
a major customer who had around 2000 cluster nodes, all running with
between 8 and 16G of RAM, typically either 8x 1G sticks or 8x2G sticks.

For simplicity's sake, say there were 16,000 1G sticks. A typical day,
there were at least 5 additional sticks that had in excess of 10
correctable errors (sometimes in the 100s of CEs). Nodes that were busy
with a compute job weren't taken out of service until their jobs were
done, and typically continued to see about the same error rates, but
occasionally did get exponentially worse. Really the only time a machine
was taken out of service while still running a job was when it crashed.
Note that they *were* configured to panic upon detecting an uncorrectable
error (UE), as the uncorrected bit flip means data corruption, which is a
Bad Thing(tm) for, well, anyone, typically. UEs typically didn't happen
that often though, maybe one every few months out of all those sticks.

So yeah, I'd have to guesstimate an average of between 500 and 1000 CEs
per day across ~16,000 1G sticks, which does work out to something in the
same neighborhood as one bit error per month per GB of memory.

HTH,

-- 
Jarod Wilson
jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org