Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
On 09/28/2010 10:17 PM, Derek Atkins wrote: > Jarod, > > On Tue, September 28, 2010 10:10 pm, Jarod Wilson wrote: > =20 >> On Tue, Sep 28, 2010 at 10:02 PM, Derek Atkins <derek-CrUh67yIh4IAvxtiuMwx3w at public.gmane.org> wrote= : >> =20 >>> Ok, >>> >>> On Tue, September 28, 2010 9:47 pm, Jarod Wilson wrote: >>> =20 >>>> On Tue, Sep 28, 2010 at 9:48 AM, Derek Atkins <warlord-3s7WtUTddSA at public.gmane.org> wrot= e: >>>> =20 >>>>> I noticed the following in my mcelog, and I was hoping someone coul= d >>>>> help be decode this. My google fu has not let me to an answer. >>>>> >>>>> I'm running a Supermicro H8DA3-2 with two Quad-Core AMD Opteron(tm)= >>>>> Processor 2378 and 16GB of RAM (8 sticks of ACTICA DDR2 667 2GB ECC= >>>>> REG) >>>>> purchased with the machine in Jan, 2009. >>>>> >>>>> Is this a memory issue? >>>>> =20 >>>> At first glance, it looks to be a bad cpu l3 cache, but hard to say = for >>>> sure... >>>> =20 > [snip] > =20 >>> tail -6 /var/log/mcelog | mcelog --k8 --ascii >>> MCE 0 >>> HARDWARE ERROR. This is *NOT* a software problem! >>> Please contact your hardware vendor >>> MISC c008000001000000 ADDR 234909fc0 >>> STATUS 9c524484001d011b MCGSTATUS 0 >>> HARDWARE ERROR. This is *NOT* a software problem! >>> Please contact your hardware vendor >>> CPU 0 0 data cache MISC c008000001000000 ADDR 234909fc0 >>> Data cache ECC error (syndrome a4) >>> bit34 =3D err cpu2 >>> bit42 =3D L3 subcache in error bit 0 >>> bit46 =3D corrected ecc error >>> bit59 =3D misc error valid >>> memory/cache error 'generic read mem transaction, generic transactio= n, >>> level generic' >>> STATUS 9c524484001d011b MCGSTATUS 0 >>> (Fields were incomplete) >>> >>> So what does this mean? >>> =20 >> Well, mcelog seems to think you have a bad CPU, but I'd have to talk >> to some of the hardware folks at work to get a better idea exactly >> what's up. Seems possible its just an ecc memory error too though, and= >> one that was corrected. Do you have any edac modules loaded? Not sure >> if that box needs edac_amd64 or something else, and/or when exactly it= >> was that edac_amd64 finally got merged upstream (and therefore into >> the Fedora kernels). Yeah, lemme (try to remember to) poke some folks >> who actually work on this code and know the hardware better >> tomorrow... >> =20 > Thanks. It *is* ECC memory. I'd much rather replace my 2yo ECC RAM th= an > replace my CPU. In either case it's annoying. > > I temporarily downgraded from 2.6.34.6-54 to 2.6.27.41-170.2.117 in ord= er > to keep my VMs from dying and this seems to be helping. Running 2.6.34= > I'd have periodic cases where VMs would spin, md_raid would spin, and t= he > network would drop to all my VMs, and sometimes the VMs would report > ATA/SCSI disk errors. I didn't have any of those issues prior to > upgrading, and haven't had them since rebooting into the older kernel. > > But I suspect there's still really a hardware problem somewhere. :( > =20 > Just one thing to add. I have a number of servers with Supermicro boards, and one of them won't boot unless I blacklist one of the edac modules. That system has 64GB ECC memory and either 1 or 2 Intel Xeon CPUs (One of my systems only has 1 CPU the rest have 2). If you are interested I can email you with the modules I am blacklisting. --=20 Jerry Feldman <gaf-mNDKBlG2WHs at public.gmane.org> Boston Linux and Unix PGP key id: 537C5846 PGP Key fingerprint: 3D1B 8377 A3C0 A5F2 ECBB CA3B 4607 4319 537C 5846
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |