decoding MCE Logs? Possible hardware issue?
Jerry Feldman
gaf-mNDKBlG2WHs at public.gmane.org
Wed Sep 29 11:09:23 EDT 2010
On 09/29/2010 10:44 AM, Derek Atkins wrote:
> On Wed, September 29, 2010 10:10 am, Jerry Feldman wrote:
> =20
>> On 09/29/2010 09:29 AM, Derek Atkins wrote:
>> =20
>>> Jerry Feldman <gaf-mNDKBlG2WHs at public.gmane.org> writes:
>>>
>>>
>>> =20
>>>>> But I suspect there's still really a hardware problem somewhere. :=
(
>>>>>
>>>>>
>>>>>
>>>>> =20
>>>> Just one thing to add. I have a number of servers with Supermicro
>>>> boards, and one of them won't boot unless I blacklist one of the eda=
c
>>>> modules. That system has 64GB ECC memory and either 1 or 2 Intel Xeo=
n
>>>> CPUs (One of my systems only has 1 CPU the rest have 2). If you are=
>>>> interested I can email you with the modules I am blacklisting.
>>>>
>>>> =20
>>> Note that this is a Supermicro with AMD CPUs. It only has 16GB RAM
>>> right now, but I might extend that if I find that some of the RAM is
>>> bad. The system boots just fine, and I do not have any edac modules
>>> loaded at all (according to lsmod). So I'm not sure what blacklistin=
g
>>> it would accomplish?
>>>
>>> -derek
>>>
>>>
>>> =20
>> I've got 5 systems with Supermicro X7DB8+ Mother Boards, and only one
>> has problems with the edac modules. In my search for a solution to the=
>> udev hang problem I found a lot of pointers to Supermicro boards. I
>> don't know why that one has the issue. It certainly is a much differen=
t
>> issue than you have. My lsmod on another system shows:
>> [gaf at boslc05 ~]$ lsmod | grep edac
>> i5000_edac 42177 0
>> edac_mc 60193 1 i5000_edac
>>
>> In any case I was just trying to provide some additional information.
>> =20
> Interesting! I wonder if this is an Intel v. AMD thing? Or perhaps a
> 2.6.27 v. 2.6.34 thing? Or maybe it's an X7DB8+ v. H8DA3-2 thing?
>
> I'd turn off edac if it looked like it was actually loading on my syste=
m.
>
> Ahh, the joys of ECC RAM -- harder to tell when the RAM is bad. ;)
>
> -derek
>
>
>
> =20
2.6.18-92.el5 on 4 of the systems, and 2.6.18-128.el5 on the system
blacklisted, but I believe it had the same issue when it ran RHEL 5.2.=20
But, as it appears in the logs and in Jarod's emails, it appears to be
l3 cache. My systems are all Intel whitebox systems we got free.
--=20
Jerry Feldman <gaf-mNDKBlG2WHs at public.gmane.org>
Boston Linux and Unix
PGP key id: 537C5846
PGP Key fingerprint: 3D1B 8377 A3C0 A5F2 ECBB CA3B 4607 4319 537C 5846
More information about the Discuss
mailing list