Is there a daemon to kill growing processes?

Wed May 28 15:42:39 EDT 2003

David Kramer comments:
| I have this problem with my Red Hat 7.3 server, where every now and then it
| gets into this mode where the hard drive is constantly active, the keyboard
| and mouse can take a minute or more to respond to each key press/mouse jerk.
| Sometimes it rights itself after an hour or so, sometimes I have to reboot.
| When in this state, networking works find- I can ping the box and it will
| masquerade to the internet as it is supposed to, but sshing to it hangs.
|
| I have, on occassion, been able to catch a userland process taking all of the
| memory and cycles when the computer is in this state.  On occasion it has
| been the sync software for my PDA.

Yeah, I've found that kpilot can start eating 99.9% of the cpu  after
it has finished a sync.  I just kill it and the problem is fixed.

But more generally, I've seen the same sort of behavior on some  test
machines  for  my current project.  We have RedHat 7.2, 7.3, 8.0, and
9.0 machines.  This is because the client  has  all  these  kinds  of
machines,  and  we need to test our software on all of them.  The 7.*
machines often show the behavior you describe, but  according  to  ps
and top, there are no processes using more that 1% of the cpu.

What it seems to correlate with is seeing kupdated  and/or  kjournald
at  the  top  of top's cpu users.  We've been trying to discover more
info about this, but haven't learned much.

One of our theories was that it had to do with  the  journaling  file
system.  So I dug into that, and found that by "downgrading" the file
systems from ext3 to ext2, the journaling should be disabled.   Maybe
it  is,  but  kjournald is still started at boot time (although there
are no instances of "kjournald" in any startup  script  that  we  can
find),  and it still becomes active at times.  When this happens, all
other processes' cpu usage drops to near zero,  and  everything  that
does disk I/O seems to stop responding to input. Sometimes this lasts
for a minute or more.

This is especially unnerving considering that the machines are to  be
used as web servers. Total response failure for a minute at a time is
a bit of a showstopper for a web server.

We also suspect that the problem may be the  RAID  software,  but  we
haven't found any way to test this. Part of our reason for suspecting
this is that "vmstat 1" shows the system flipping between two states.
In  the  responsive state, the bo and bi (blocks in and out) are both
mostly 3 digits.  In the unresponsive state, bi drops to zero for the
duration, while bo switches between 0 and 5 digits. It looks like the
kernel has decided to flush buffers, and has halted all  other  tasks
while this is done. This doesn't seem like normal kernel behavior. We
haven't found any hooks that can tell us why this happens.

We don't seem to get this misbehavior with the 8.0 machines.  We  may
have  to tell the client to stop using RH 7.* and move to 8.0 (or 9.0
if that works out).  Or maybe we'll suddenly start getting it on  the
other machines after some unknowable change happens.