Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
On Thursday 26 February 2009 06:24:07 john_re wrote: > Do you use ECC RAM? Do you have any data about failure rates? > > I'm evaluating this for a system with 8GB DRAM, & > http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction > says > "Tests[ecc]give widely varying error rates, but about 10-12upset/bit-hr > is typical, roughly one bit error, per month, per gigabyte of memory. > > In most computers used for serious scientific or financial computing and > as servers, ECC is the rule rather than the exception, as can be seen by > examining manufacturers' specifications." > > > So, for that data 8GB DRAM is about 8 errors per month, ie about > one per 3-4 days. > > What rates do you have? Back in the day, I worked for (the now defunct) Linux Networx, on site at a major customer who had around 2000 cluster nodes, all running with between 8 and 16G of RAM, typically either 8x 1G sticks or 8x2G sticks. For simplicity's sake, say there were 16,000 1G sticks. A typical day, there were at least 5 additional sticks that had in excess of 10 correctable errors (sometimes in the 100s of CEs). Nodes that were busy with a compute job weren't taken out of service until their jobs were done, and typically continued to see about the same error rates, but occasionally did get exponentially worse. Really the only time a machine was taken out of service while still running a job was when it crashed. Note that they *were* configured to panic upon detecting an uncorrectable error (UE), as the uncorrected bit flip means data corruption, which is a Bad Thing(tm) for, well, anyone, typically. UEs typically didn't happen that often though, maybe one every few months out of all those sticks. So yeah, I'd have to guesstimate an average of between 500 and 1000 CEs per day across ~16,000 1G sticks, which does work out to something in the same neighborhood as one bit error per month per GB of memory. HTH, -- Jarod Wilson jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |