BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

random system freeze

Subject: random system freeze
From: kristian.hermansen-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org (Kristian Hermansen)
Date: Sun, 8 Jul 2007 22:52:24 -0400
In-reply-to: <46918785.3030207-yrHdaQSNc4gdnm+yROfE0A@public.gmane.org>
References: <46911517.5070908@newmag.org> <fe37588d0707080955p761d50f3na329eaf8b2948943@mail.gmail.com> <46918785.3030207@newmag.org>

Check out this article on debugging kernel issues:
http://resource.intel.com/telecom/support/tnotes/tnbyos/2000/tn062.htm

But the easiest way I found is to leave up a virtual terminal
(blanking off, no screensaver, etc) overnight.  Try to make the system
emulate the issue by whatever means necessary, but keep that console
up.  If you have a serial connection, that's the way to go, but some
people don't.  When the kernel does "aieee!" (hard panic), you will
see the output on the virtual terminal which will give a hint.  I have
used this technique to pin down numerous ndiswrapper issues on AMD64
using a 64-bit Windows driver.  It works for any code that causes the
kernel to dump, so give it a try...

<snip>
  How to Troubleshoot a Hard Kernel Panic
Hard Panics ? Symptoms:

   1. Machine is completely locked up and unusable.
   2. Num Lock / Caps Lock / Scroll Lock keys usually blink.
   3. If in console mode, dump is displayed on monitor (including the
phrase "Aieee!").
   4. Similar to Windows Blue Screen.

Hard panics ? causes:
The most common cause of a hard kernel panic is when a driver crashes
within an interrupt handler, usually because it tried to access a null
pointer within the interrupt handler.  When this happens, that driver
cannot handle any new interrupts and eventually the system crashes.
This is not exclusive to Dialogic drivers.

Hard panics ? information to collect:
Depending on the nature of the panic, the kernel will log all
information it can prior to locking up.  Since a kernel panic is a
drastic failure, it is uncertain how much information will be logged.
Below are key pieces of information to collect.  It is important to
collect as many of these as possible, but there is no guarantee that
all of them will be available, especially the first time a panic is
seen.

   1. /srv/log/messages -- sometimes the entire kernel panic stack
trace will be logged there
   2. Application / Library logs (RTF, cheetah, etc.) ? may show what
was happening before the panic
   3. Other information about what happened just prior to the panic,
or how to reproduce
   4. Screen dump from console. Since the OS is locked, you cannot cut
and paste from the screen. There are two common ways to get this info:
          * Digital Picture of screen (preferred, since it's quicker and easier)
          * Copying screen with pen and paper or typing to another computer

If the dump is not available either in /srv/log/message or on the
screen, follow these tips to get a dump:

   1. If in GUI mode, switch to full console mode ? no dump info is
passed to the GUI (not even to GUI shell).
   2. Make sure screen stays on during full test run ? if a screen
saver kicks in, the screen won't return after a kernel panic. Use
these settings to ensure the screen stays on.
          * setterm -blank 0
          * setterm -powerdown 0
          * setvesablank off
   3. From console, copy dump from screen (see above).

Hard panics ? Troubleshooting when a full trace is available
The stack trace is the most important piece of information to use in
troubleshooting a kernel panic.  It is often crucial to have a full
stack trace, something that may not be available if only a screen dump
is provided ? the top of the stack may scroll off the screen, leaving
only a partial stack trace.  If a full trace is available, it is
usually sufficient to isolate root cause.  To identify whether or not
you have a large enough stack trace, look for a line with EIP, which
will show what function call and module caused the panic.  In the
example below, this is shown in the following line:
          EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe

If the culprit is a Dialogic driver you will see a module name with:
          streams-xxxxDriver (xxxx = dlgn, dvbm, mercd, etc.)

Hard panic ? full trace example:
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
f89e568a
*pde = 32859001
*pte = 00000000
Oops: 0000
Kernel 2.4.9-31enterprise
CPU:    1
EIP:    0010:[<f89e568a>]    Tainted: PF
EFLAGS: 00010096
EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe
eax: 00000000   ebx: f65f5410   ecx: f5e16710   edx: f65f5410
esi: 00001ea0   edi: f5e23c30   ebp: f65f5410   esp: f1cf7e78
ds: 0018   es: 0018   ss: 0018
Process pwcallmgr (pid: 10334, stackpage=f1cf7000)
Stack: 00000000 c01067fa 00000086 f1cf7ec0 00001ea0 f5e23c30 f65f5410 f89e53ec
f89fcd60 f5e16710 f65f5410 f65f5410 f8a54420 f1cf7ec0 f8a4d73a 0000139e
f5e16710 f89fcd60 00000086 f5e16710 f5e16754 f65f5410 0000034a f894e648
Call Trace: [setup_sigcontext+218/288] setup_sigcontext [kernel] 0xda
Call Trace: [<c01067fa>] setup_sigcontext [kernel] 0xda
[<f89e53ec>] dlgnwput [streams-dlgnDriver] 0xe8
[<f89fcd60>] Sm_Handle [streams-dlgnDriver] 0x1ea0
[<f8a54420>] intdrv_lock [streams-dlgnDriver] 0x0
[<f8a4d73a>] Gn_Maxpm [streams-dlgnDriver] 0x8ba
[<f89fcd60>] Sm_Handle [streams-dlgnDriver] 0x1ea0
[<f894e648>] lis_safe_putnext [streams] 0x168
[<f8a7b098>] __insmod_streams-dvbmDriver_S.bss_L117376
[streams-dvbmDriver] 0xab8 [<f8a78821>] dvbmwput [streams-dvbmDriver]
0x6f5
[<f8a79f98>] dvwinit [streams-dvbmDriver] 0x2c0
[<f894e648>] lis_safe_putnext [streams] 0x168
[<f893e6d8>] lis_strputpmsg [streams] 0x54c
[<f895482e>] __insmod_streams_S.rodata_L35552 [streams] 0x182e
[<f8951227>] sys_putpmsg [streams] 0x6f
[system_call+51/56] system_call [kernel] 0x33
[<c010719b>] system_call [kernel] 0x33
Nov 28 12:17:58 talus kernel:
Nov 28 12:17:58 talus kernel:
Code: 8b 70 0c 8b 06 83 f8 20 8b 54 24 20 8b 6c 24 24 76 1c 89 5c

Hard panics ? Troubleshooting when a full trace is not available
If only a partial stack trace is available, it can be tricky to
isolate the root cause, since there is no explicit information about
what module of function call caused the panic.  Instead, only commands
leading up to the final command will be seen in a partial stack trace.
 In this case, it is very important to collect as much information as
possible about what happened leading up to the kernel panic
(application logs, library traces, steps to reproduce, etc).

Hard panic ? partial trace example (note there is no line with EIP information)
[<c01e42e7>] ip_rcv [kernel] 0x357
[<f8a179d5>] sramintr [streams_dlgnDriver] 0x32d
[<f89a3999>] lis_spin_lock_irqsave_fcn [streams] 0x7d
[<f8a82fdc>] inthw_lock [streams_dlgnDriver] 0x1c
[<f8a7bad8>] pwswtbl [streams_dlgnDriver] 0x0
[<f8a15442>] dlgnintr [streams_dlgnDriver] 0x4b
[<f8a7c30a>] Gn_Maxpm [streams_dlgnDriver] 0x7ae
[<c0123bc1>] __run_timers [kernel] 0xd1
[<c0108a6e>] handle_IRQ_event [kernel] 0x5e
[<c0108c74>] do_IRQ [kernel] 0xa4
[<c0105410>] default_idle [kernel] 0x0
[<c0105410>] default_idle [kernel] 0x0
[<c022fab0>] call_do_IRQ [kernel] 0x5
[<c0105410>] default_idle [kernel] 0x0
[<c0105410>] default_idle [kernel] 0x0
[<c010543d>] default_idle [kernel] 0x2d
[<c01054c2>] cpu_idle [kernel] 0x2d
[<c011bb86>] __call_console_drivers [kernel] 0x4b
[<c011bcfb>] call_console_drivers [kernel] 0xeb
Code: 8b 50 0c 85 d2 74 31 f6 42 0a 02 74 04 89 44 24 08 31 f6 0f
<0> Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

Hard panics ? using kernel debugger (KDB)
If only a partial trace is available and the supporting information is
not sufficient to isolate root cause, it may be useful to use KDB.
KDB is a tool that is compiled into the kernel that causes the kernel
to break into a shell rather than lock up when a panic occurs.  This
enables you to collect additional information about the panic, which
is often useful in determining root cause.

Some important things to note about using KDB:

   1. If this is a potential Dialogic issue, technical support should
be contacted prior to the to use of KDB
   2. Must use base kernel ? i.e. 2.4.18 kernel instead of 2.4.18-5
from RedHat. This is because KDB is only available for the base
kernels, and not the builds created by RedHat. While this does create
a slight deviation from the original configuration, it usually does
not interfere with root cause analysis.
   3. Need different Dialogic drivers compiled to handle the specific kernel.
</snip>
-- 
Kristian Hermansen

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

References:
- random system freeze
  - From: eric-yrHdaQSNc4gdnm+yROfE0A at public.gmane.org (eric c)
- random system freeze
  - From: kristian.hermansen-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org (Kristian Hermansen)
- random system freeze
  - From: eric-yrHdaQSNc4gdnm+yROfE0A at public.gmane.org (eric c)

Prev by Date: random system freeze
Next by Date: random system freeze
Previous by thread: random system freeze
Next by thread: random system freeze
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org