BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

decoding MCE Logs? Possible hardware issue?

Subject: decoding MCE Logs? Possible hardware issue?
From: gaf-mNDKBlG2WHs at public.gmane.org (Jerry Feldman)
Date: Wed, 29 Sep 2010 07:39:04 -0400
In-reply-to: <76f7dd4322eb972c4ade03fba6ba6150.squirrel-4rlZbEtjZ/ZxQRyTOB7HMw@public.gmane.org>
References: <sjmr5gex9kk.fsf@pgpdev.ihtfp.org> <AANLkTikWfuisrVX2yH+M1gQc=zT6NWC712tNuoeeiW1O@mail.gmail.com> <4f8d21e0d7cb2ecf7fcb96a315d4a6de.squirrel@mail2.ihtfp.org> <AANLkTimhAVgk0Ubc=Yw2mjT0TzurF_BjQ7f5RNKVRR_i@mail.gmail.com> <76f7dd4322eb972c4ade03fba6ba6150.squirrel@mail2.ihtfp.org>

On 09/28/2010 10:17 PM, Derek Atkins wrote:
> Jarod,
>
> On Tue, September 28, 2010 10:10 pm, Jarod Wilson wrote:
>  =20
>> On Tue, Sep 28, 2010 at 10:02 PM, Derek Atkins <derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org> wrote=
:
>>    =20
>>> Ok,
>>>
>>> On Tue, September 28, 2010 9:47 pm, Jarod Wilson wrote:
>>>      =20
>>>> On Tue, Sep 28, 2010 at 9:48 AM, Derek Atkins <warlord-3s7WtUTddSA at public.gmane.org> wrot=
e:
>>>>        =20
>>>>> I noticed the following in my mcelog, and I was hoping someone coul=
d
>>>>> help be decode this.  My google fu has not let me to an answer.
>>>>>
>>>>> I'm running a Supermicro H8DA3-2 with two Quad-Core AMD Opteron(tm)=

>>>>> Processor 2378 and 16GB of RAM (8 sticks of ACTICA DDR2 667 2GB ECC=

>>>>> REG)
>>>>> purchased with the machine in Jan, 2009.
>>>>>
>>>>> Is this a memory issue?
>>>>>          =20
>>>> At first glance, it looks to be a bad cpu l3 cache, but hard to say =
for
>>>> sure...
>>>>        =20
> [snip]
>  =20
>>> tail -6 /srv/log/mcelog  | mcelog --k8 --ascii
>>> MCE 0
>>> HARDWARE ERROR. This is *NOT* a software problem!
>>> Please contact your hardware vendor
>>> MISC c008000001000000 ADDR 234909fc0
>>> STATUS 9c524484001d011b MCGSTATUS 0
>>> HARDWARE ERROR. This is *NOT* a software problem!
>>> Please contact your hardware vendor
>>> CPU 0 0 data cache MISC c008000001000000 ADDR 234909fc0
>>>  Data cache ECC error (syndrome a4)
>>>       bit34 =3D err cpu2
>>>       bit42 =3D L3 subcache in error bit 0
>>>       bit46 =3D corrected ecc error
>>>       bit59 =3D misc error valid
>>>  memory/cache error 'generic read mem transaction, generic transactio=
n,
>>> level generic'
>>> STATUS 9c524484001d011b MCGSTATUS 0
>>> (Fields were incomplete)
>>>
>>> So what does this mean?
>>>      =20
>> Well, mcelog seems to think you have a bad CPU, but I'd have to talk
>> to some of the hardware folks at work to get a better idea exactly
>> what's up. Seems possible its just an ecc memory error too though, and=

>> one that was corrected. Do you have any edac modules loaded? Not sure
>> if that box needs edac_amd64 or something else, and/or when exactly it=

>> was that edac_amd64 finally got merged upstream (and therefore into
>> the Fedora kernels). Yeah, lemme (try to remember to) poke some folks
>> who actually work on this code and know the hardware better
>> tomorrow...
>>    =20
> Thanks.  It *is* ECC memory.  I'd much rather replace my 2yo ECC RAM th=
an
> replace my CPU.  In either case it's annoying.
>
> I temporarily downgraded from 2.6.34.6-54 to 2.6.27.41-170.2.117 in ord=
er
> to keep my VMs from dying and this seems to be helping.  Running 2.6.34=

> I'd have periodic cases where VMs would spin, md_raid would spin, and t=
he
> network would drop to all my VMs, and sometimes the VMs would report
> ATA/SCSI disk errors.  I didn't have any of those issues prior to
> upgrading, and haven't had them since rebooting into the older kernel.
>
> But I suspect there's still really a hardware problem somewhere.  :(
>  =20
>
Just one thing to add. I have a number of servers with Supermicro
boards, and one of them won't boot unless I blacklist one of the edac
modules. That system has 64GB ECC memory and either 1 or 2 Intel Xeon
CPUs (One of my systems only has 1 CPU the rest have 2).  If you are
interested I can email you with the modules I am blacklisting.

--=20
Jerry Feldman <gaf-mNDKBlG2WHs at public.gmane.org>
Boston Linux and Unix
PGP key id: 537C5846
PGP Key fingerprint: 3D1B 8377 A3C0 A5F2 ECBB  CA3B 4607 4319 537C 5846

References:
- decoding MCE Logs? Possible hardware issue?
  - From: warlord-DPNOqEs/LNQ at public.gmane.org (Derek Atkins)
- decoding MCE Logs? Possible hardware issue?
  - From: jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org (Jarod Wilson)
- decoding MCE Logs? Possible hardware issue?
  - From: derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org (Derek Atkins)
- decoding MCE Logs? Possible hardware issue?
  - From: jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org (Jarod Wilson)
- decoding MCE Logs? Possible hardware issue?
  - From: derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org (Derek Atkins)

Prev by Date: decoding MCE Logs? Possible hardware issue?
Next by Date: decoding MCE Logs? Possible hardware issue?
Previous by thread: decoding MCE Logs? Possible hardware issue?
Next by thread: decoding MCE Logs? Possible hardware issue?
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org