BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

decoding MCE Logs? Possible hardware issue?

Subject: decoding MCE Logs? Possible hardware issue?
From: derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org (Derek Atkins)
Date: Tue, 28 Sep 2010 22:17:05 -0400
In-reply-to: <AANLkTimhAVgk0Ubc=Yw2mjT0TzurF_BjQ7f5RNKVRR_i-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
References: <sjmr5gex9kk.fsf@pgpdev.ihtfp.org> <AANLkTikWfuisrVX2yH+M1gQc=zT6NWC712tNuoeeiW1O@mail.gmail.com> <4f8d21e0d7cb2ecf7fcb96a315d4a6de.squirrel@mail2.ihtfp.org> <AANLkTimhAVgk0Ubc=Yw2mjT0TzurF_BjQ7f5RNKVRR_i@mail.gmail.com>

Jarod,

On Tue, September 28, 2010 10:10 pm, Jarod Wilson wrote:
> On Tue, Sep 28, 2010 at 10:02 PM, Derek Atkins <derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org> wrote:
>> Ok,
>>
>> On Tue, September 28, 2010 9:47 pm, Jarod Wilson wrote:
>>> On Tue, Sep 28, 2010 at 9:48 AM, Derek Atkins <warlord-3s7WtUTddSA at public.gmane.org> wrote:
>>>> I noticed the following in my mcelog, and I was hoping someone could
>>>> help be decode this. ?My google fu has not let me to an answer.
>>>>
>>>> I'm running a Supermicro H8DA3-2 with two Quad-Core AMD Opteron(tm)
>>>> Processor 2378 and 16GB of RAM (8 sticks of ACTICA DDR2 667 2GB ECC
>>>> REG)
>>>> purchased with the machine in Jan, 2009.
>>>>
>>>> Is this a memory issue?
>>>
>>> At first glance, it looks to be a bad cpu l3 cache, but hard to say for
>>> sure...
[snip]
>> tail -6 /var/log/mcelog ?| mcelog --k8 --ascii
>> MCE 0
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> MISC c008000001000000 ADDR 234909fc0
>> STATUS 9c524484001d011b MCGSTATUS 0
>> HARDWARE ERROR. This is *NOT* a software problem!
>> Please contact your hardware vendor
>> CPU 0 0 data cache MISC c008000001000000 ADDR 234909fc0
>> ?Data cache ECC error (syndrome a4)
>> ? ? ? bit34 = err cpu2
>> ? ? ? bit42 = L3 subcache in error bit 0
>> ? ? ? bit46 = corrected ecc error
>> ? ? ? bit59 = misc error valid
>> ?memory/cache error 'generic read mem transaction, generic transaction,
>> level generic'
>> STATUS 9c524484001d011b MCGSTATUS 0
>> (Fields were incomplete)
>>
>> So what does this mean?
>
> Well, mcelog seems to think you have a bad CPU, but I'd have to talk
> to some of the hardware folks at work to get a better idea exactly
> what's up. Seems possible its just an ecc memory error too though, and
> one that was corrected. Do you have any edac modules loaded? Not sure
> if that box needs edac_amd64 or something else, and/or when exactly it
> was that edac_amd64 finally got merged upstream (and therefore into
> the Fedora kernels). Yeah, lemme (try to remember to) poke some folks
> who actually work on this code and know the hardware better
> tomorrow...

Thanks.  It *is* ECC memory.  I'd much rather replace my 2yo ECC RAM than
replace my CPU.  In either case it's annoying.

I temporarily downgraded from 2.6.34.6-54 to 2.6.27.41-170.2.117 in order
to keep my VMs from dying and this seems to be helping.  Running 2.6.34
I'd have periodic cases where VMs would spin, md_raid would spin, and the
network would drop to all my VMs, and sometimes the VMs would report
ATA/SCSI disk errors.  I didn't have any of those issues prior to
upgrading, and haven't had them since rebooting into the older kernel.

But I suspect there's still really a hardware problem somewhere.  :(

> Jarod Wilson
> jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org

-derek

References:
- decoding MCE Logs? Possible hardware issue?
  - From: warlord-DPNOqEs/LNQ at public.gmane.org (Derek Atkins)
- decoding MCE Logs? Possible hardware issue?
  - From: jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org (Jarod Wilson)
- decoding MCE Logs? Possible hardware issue?
  - From: derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org (Derek Atkins)
- decoding MCE Logs? Possible hardware issue?
  - From: jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org (Jarod Wilson)

Prev by Date: decoding MCE Logs? Possible hardware issue?
Next by Date: decoding MCE Logs? Possible hardware issue?
Previous by thread: decoding MCE Logs? Possible hardware issue?
Next by thread: decoding MCE Logs? Possible hardware issue?
Index(es):
- Date
- Thread


BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Boston Linux & Unix / webmaster@blu.org