Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

decoding MCE Logs? Possible hardware issue?



On Tue, Sep 28, 2010 at 10:02 PM, Derek Atkins <derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org> wrote:
> Ok,
>
> On Tue, September 28, 2010 9:47 pm, Jarod Wilson wrote:
>> On Tue, Sep 28, 2010 at 9:48 AM, Derek Atkins <warlord-3s7WtUTddSA at public.gmane.org> wrote:
>>> I noticed the following in my mcelog, and I was hoping someone could
>>> help be decode this. ?My google fu has not let me to an answer.
>>>
>>> I'm running a Supermicro H8DA3-2 with two Quad-Core AMD Opteron(tm)
>>> Processor 2378 and 16GB of RAM (8 sticks of ACTICA DDR2 667 2GB ECC REG)
>>> purchased with the machine in Jan, 2009.
>>>
>>> Is this a memory issue?
>>
>> At first glance, it looks to be a bad cpu l3 cache, but hard to say for
>> sure...
>>
>> $ <paste your log into file 'log'>
>> $ mcelog --k8 --ascii < log
>>
>> mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
>> MCE 0
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> MISC c008000001000000 ADDR 1c88309c0
>> STATUS 9c6cc450001d017b MCGSTATUS 0
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> CPU 0 0 data cache MISC c008000001000000 ADDR 1c88309c0
>> ? Data cache ECC error (syndrome d9)
>> ? ? ? ?bit42 = L3 subcache in error bit 0
>> ? ? ? ?bit46 = corrected ecc error
>> ? ? ? ?bit59 = misc error valid
>> ? memory/cache error 'evict mem transaction, generic transaction, level
>> generic'
>> STATUS 9c6cc450001d017b MCGSTATUS 0
>> (Fields were incomplete)
>>
>> I'd run mcelog with root privs on that machine itself and without the
>> --k8 flag (I ran on an Intel box) to make sure its got the right cpu
>> type and access to /dev/mem for more accurate results...
>
> tail -6 /var/log/mcelog ?| mcelog --k8 --ascii
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> MISC c008000001000000 ADDR 234909fc0
> STATUS 9c524484001d011b MCGSTATUS 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 0 data cache MISC c008000001000000 ADDR 234909fc0
> ?Data cache ECC error (syndrome a4)
> ? ? ? bit34 = err cpu2
> ? ? ? bit42 = L3 subcache in error bit 0
> ? ? ? bit46 = corrected ecc error
> ? ? ? bit59 = misc error valid
> ?memory/cache error 'generic read mem transaction, generic transaction,
> level generic'
> STATUS 9c524484001d011b MCGSTATUS 0
> (Fields were incomplete)
>
> So what does this mean?

Well, mcelog seems to think you have a bad CPU, but I'd have to talk
to some of the hardware folks at work to get a better idea exactly
what's up. Seems possible its just an ecc memory error too though, and
one that was corrected. Do you have any edac modules loaded? Not sure
if that box needs edac_amd64 or something else, and/or when exactly it
was that edac_amd64 finally got merged upstream (and therefore into
the Fedora kernels). Yeah, lemme (try to remember to) poke some folks
who actually work on this code and know the hardware better
tomorrow...


-- 
Jarod Wilson
jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org







BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org