mcelong reports AMD DRAM Parity Error?

Thu Nov 18 10:44:07 EST 2010

On Nov 18, 2010, at 10:30 AM, Derek Atkins wrote:

> Hey,
> 
> Back onto my mcelog issue from a while ago..

Crap, I apologize, I'd meant to follow up on this, and it fell
through the cracks... So I jumped right on it right now.

> I finally updated to the
> newly released mcelog.x86_64 2:1.0-0.1.pre3.fc13 and when I ran mcelog
> I got this output:
> 
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> MCE 0
> CPU 0 4 northbridge TSC 24b8cb30a62636 
> MISC c008000001000000 ADDR 3c5e80c80 
>  Northbridge DRAM Parity Error
>       bit34 = err cpu2
>       bit43 = L3 subcache in error bit 1
>       bit46 = corrected ecc error
>       bit59 = misc error valid
>  memory/cache error 'generic read mem transaction, generic transaction, level generic'
> STATUS 9c294834001d011b MCGSTATUS 0
> SOCKETID 0 
> 
> Does this mean I have a busted CPU?  Or busted RAM?

RAM. However, its not a fatal error, its simply a corrected
ecc error. I'm told this is all a single event here, and the
event was the corrected ecc error, anyway. So you might want
to replace some memory at some point, but hey, its ecc memory
doing what its designed to do here.

Back in the day, working on clusters of opteron nodes that all
had 8 to 16 sticks of memory in them, it wasn't uncommon for
at least one stick in any given machine to experience a few
corrected ecc errors per day, and we only replaced a stick if
it started having more than 100/day. Granted, these machines
were getting *hammered* with large CFD jobs 24/7, so memory
was getting stressed pretty hard.

I'd probably not worry about the memory too much, unless its
happening at least daily, and/or if its causing some sort of
noticeable performance hit.

-- 
Jarod Wilson
jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org