Is there a daemon to kill growing processes?
John Chambers
jc at trillian.mit.edu
Wed May 28 15:42:39 EDT 2003
David Kramer comments:
| I have this problem with my Red Hat 7.3 server, where every now and then it
| gets into this mode where the hard drive is constantly active, the keyboard
| and mouse can take a minute or more to respond to each key press/mouse jerk.
| Sometimes it rights itself after an hour or so, sometimes I have to reboot.
| When in this state, networking works find- I can ping the box and it will
| masquerade to the internet as it is supposed to, but sshing to it hangs.
|
| I have, on occassion, been able to catch a userland process taking all of the
| memory and cycles when the computer is in this state. On occasion it has
| been the sync software for my PDA.
Yeah, I've found that kpilot can start eating 99.9% of the cpu after
it has finished a sync. I just kill it and the problem is fixed.
But more generally, I've seen the same sort of behavior on some test
machines for my current project. We have RedHat 7.2, 7.3, 8.0, and
9.0 machines. This is because the client has all these kinds of
machines, and we need to test our software on all of them. The 7.*
machines often show the behavior you describe, but according to ps
and top, there are no processes using more that 1% of the cpu.
What it seems to correlate with is seeing kupdated and/or kjournald
at the top of top's cpu users. We've been trying to discover more
info about this, but haven't learned much.
One of our theories was that it had to do with the journaling file
system. So I dug into that, and found that by "downgrading" the file
systems from ext3 to ext2, the journaling should be disabled. Maybe
it is, but kjournald is still started at boot time (although there
are no instances of "kjournald" in any startup script that we can
find), and it still becomes active at times. When this happens, all
other processes' cpu usage drops to near zero, and everything that
does disk I/O seems to stop responding to input. Sometimes this lasts
for a minute or more.
This is especially unnerving considering that the machines are to be
used as web servers. Total response failure for a minute at a time is
a bit of a showstopper for a web server.
We also suspect that the problem may be the RAID software, but we
haven't found any way to test this. Part of our reason for suspecting
this is that "vmstat 1" shows the system flipping between two states.
In the responsive state, the bo and bi (blocks in and out) are both
mostly 3 digits. In the unresponsive state, bi drops to zero for the
duration, while bo switches between 0 and 5 digits. It looks like the
kernel has decided to flush buffers, and has halted all other tasks
while this is done. This doesn't seem like normal kernel behavior. We
haven't found any hooks that can tell us why this happens.
We don't seem to get this misbehavior with the 8.0 machines. We may
have to tell the client to stop using RH 7.* and move to 8.0 (or 9.0
if that works out). Or maybe we'll suddenly start getting it on the
other machines after some unknowable change happens.
More information about the Discuss
mailing list