decoding MCE Logs? Possible hardware issue?

Jerry Feldman gaf-mNDKBlG2WHs at public.gmane.org
Wed Sep 29 11:09:23 EDT 2010


On 09/29/2010 10:44 AM, Derek Atkins wrote:
> On Wed, September 29, 2010 10:10 am, Jerry Feldman wrote:
>  =20
>> On 09/29/2010 09:29 AM, Derek Atkins wrote:
>>    =20
>>> Jerry Feldman <gaf-mNDKBlG2WHs at public.gmane.org> writes:
>>>
>>>
>>>      =20
>>>>> But I suspect there's still really a hardware problem somewhere.  :=
(
>>>>>
>>>>>
>>>>>
>>>>>          =20
>>>> Just one thing to add. I have a number of servers with Supermicro
>>>> boards, and one of them won't boot unless I blacklist one of the eda=
c
>>>> modules. That system has 64GB ECC memory and either 1 or 2 Intel Xeo=
n
>>>> CPUs (One of my systems only has 1 CPU the rest have 2).  If you are=

>>>> interested I can email you with the modules I am blacklisting.
>>>>
>>>>        =20
>>> Note that this is a Supermicro with AMD CPUs.  It only has 16GB RAM
>>> right now, but I might extend that if I find that some of the RAM is
>>> bad.  The system boots just fine, and I do not have any edac modules
>>> loaded at all (according to lsmod).  So I'm not sure what blacklistin=
g
>>> it would accomplish?
>>>
>>> -derek
>>>
>>>
>>>      =20
>> I've got 5 systems with Supermicro X7DB8+ Mother Boards, and only one
>> has problems with the edac modules. In my search for a solution to the=

>> udev hang problem I found a lot of pointers to Supermicro boards. I
>> don't know why that one has the issue. It certainly is a much differen=
t
>> issue than you have. My lsmod on another system shows:
>> [gaf at boslc05 ~]$ lsmod | grep edac
>> i5000_edac             42177  0
>> edac_mc                60193  1 i5000_edac
>>
>> In any case I was just trying to provide some additional information.
>>    =20
> Interesting!  I wonder if this is an Intel v. AMD thing?  Or perhaps a
> 2.6.27 v. 2.6.34 thing?  Or maybe it's an X7DB8+ v. H8DA3-2 thing?
>
> I'd turn off edac if it looked like it was actually loading on my syste=
m.
>
> Ahh, the joys of ECC RAM -- harder to tell when the RAM is bad.  ;)
>
> -derek
>
>
>
>  =20
2.6.18-92.el5 on 4 of the systems, and 2.6.18-128.el5 on the system
blacklisted, but I believe it had the same issue when it ran RHEL 5.2.=20
But, as it appears in the logs and in Jarod's emails, it appears to be
l3 cache. My systems are all Intel whitebox systems we got free.


--=20
Jerry Feldman <gaf-mNDKBlG2WHs at public.gmane.org>
Boston Linux and Unix
PGP key id: 537C5846
PGP Key fingerprint: 3D1B 8377 A3C0 A5F2 ECBB  CA3B 4607 4319 537C 5846







More information about the Discuss mailing list