Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
David Kramer comments: | I have this problem with my Red Hat 7.3 server, where every now and then it | gets into this mode where the hard drive is constantly active, the keyboard | and mouse can take a minute or more to respond to each key press/mouse jerk. | Sometimes it rights itself after an hour or so, sometimes I have to reboot. | When in this state, networking works find- I can ping the box and it will | masquerade to the internet as it is supposed to, but sshing to it hangs. | | I have, on occassion, been able to catch a userland process taking all of the | memory and cycles when the computer is in this state. On occasion it has | been the sync software for my PDA. Yeah, I've found that kpilot can start eating 99.9% of the cpu after it has finished a sync. I just kill it and the problem is fixed. But more generally, I've seen the same sort of behavior on some test machines for my current project. We have RedHat 7.2, 7.3, 8.0, and 9.0 machines. This is because the client has all these kinds of machines, and we need to test our software on all of them. The 7.* machines often show the behavior you describe, but according to ps and top, there are no processes using more that 1% of the cpu. What it seems to correlate with is seeing kupdated and/or kjournald at the top of top's cpu users. We've been trying to discover more info about this, but haven't learned much. One of our theories was that it had to do with the journaling file system. So I dug into that, and found that by "downgrading" the file systems from ext3 to ext2, the journaling should be disabled. Maybe it is, but kjournald is still started at boot time (although there are no instances of "kjournald" in any startup script that we can find), and it still becomes active at times. When this happens, all other processes' cpu usage drops to near zero, and everything that does disk I/O seems to stop responding to input. Sometimes this lasts for a minute or more. This is especially unnerving considering that the machines are to be used as web servers. Total response failure for a minute at a time is a bit of a showstopper for a web server. We also suspect that the problem may be the RAID software, but we haven't found any way to test this. Part of our reason for suspecting this is that "vmstat 1" shows the system flipping between two states. In the responsive state, the bo and bi (blocks in and out) are both mostly 3 digits. In the unresponsive state, bi drops to zero for the duration, while bo switches between 0 and 5 digits. It looks like the kernel has decided to flush buffers, and has halted all other tasks while this is done. This doesn't seem like normal kernel behavior. We haven't found any hooks that can tell us why this happens. We don't seem to get this misbehavior with the 8.0 machines. We may have to tell the client to stop using RH 7.* and move to 8.0 (or 9.0 if that works out). Or maybe we'll suddenly start getting it on the other machines after some unknowable change happens.
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |