How do you troubleshoot completely random problems?
My home desktop machine has been suffering from a Linux kernel "Oops" approximately once every two days for the last few weeks. I would really like it to stop doing that. When I get a stack trace in my logs, it's consistently in the "kswapd" process, even though I disabled all swap weeks ago.
I'm running Edgy on this machine, just like I was running it on my laptop and am running it on my work desktop. Those machines were both completely stable (modulo occasional ndiswrapper issues) running the exact same kernel.
It doesn't seem like it's a hardware issue. At least, the same machine has never exhibited any problems under Windows.
It isn't deterministically reproducible. It always seems to be in response to a click or some kind of user-input event during heavy disk I/O, but flogging the disks and mashing the keyboard, even for hours at a time, doesn't cause it to happen.
I am considering a fresh re-install to attempt a fix for this, but besides the inelegance of that solution, it seems likely that it will leave me in the same place.
Does anyone have a suggestion for tracking this down so that I'll actually know that it's fixed?
9 comments:
I had similar problems in the past. It turned out to be a bug in the memory stick. Only way to find out about it was running an exhaustive memtest. The BIOS related one didn't catch it!
If it's not HW related, you can try running a distro from CD, either knoppix or smth similar and hope for the best/worst.
Regards,
DQ
It's definitely a hardware-related problem, and I too bet on memory.
Leave memtest86 (booting from the Edgy CD) running all night, and see what happens.
If it says nothing, try swapping the RAM just the same.
The reason it doesn't happen under Windows is probably related with a different memory allocation strategy between Windows and Linux.
Thanks for the suggestion, but, I should have added this in the original post:
The machine has already survived memtest86 running for several hours, as well as burnMMX running in parallel several times.
Given the near-unanimous reaction of "hardware problem" though, maybe it's the video card, given that the crash almost always happens during some graphical interactive operation (it generally happens when maximizing or minimizing a window, something that moves a lot of pixels, and dies halfway through, leaving the screen in a visibly inconsistent state, with a window half-redrawn).
It seems like that's the most likely thing. The only problem with that theory is that this machine survives hours of 3D applications (which are much more heat-intensive for the card) without trouble in Windows, and starting one up for a burn-in test in Linux doesn't cause problems either.
(Thanks for the suggestion though.)
I'm curious if you ever got this machine stabilized.
I've just gone through a similar situation with a machine at work. The machine is an amd64 box with ECC memory running Gentoo. At first, the machine was unstable during normal use. It would sporadically lock up. It had previously spent a week with the admins getting all the software installed, so I initially thought it was a software problem. A few kernel tweaks and the problems lessened, but still existed. Next came the memtests and when memory was found to be bad, it was replaced.
A few weeks down the line, a user of the system managed to have it lock up when he put a high load on the machine running some of his experimental code. I thought it was strange for the whole thing to lock up especially since it was fine for my normal work (mostly compilation and short debugging sessions). The first guess was more memory errors since if one stick was bad others were likely bad as well.
Again the system seemed stable, but down the road it started crashing again, but this time the crashes were very elusive and harder to reproduce. I finally noticed machine check errors in the system logs. They reported a lot of ECC corrections, way too many to be within normal operation. At this point, I found it hard to believe that more memory was bad. Finally, after talking to the manufacturer, a new mobo w/ CPUs was sent out.
I'm in the process of stressing this hardware right now.
I think the ECC memory kind of complicated the situation since single bit errors that are not close together will be corrected with nothing more than a message in the log. Stress apparently caused the frequency of the errors to increase.
It really took forever to try and work out what was actually wrong with the system. When memtest fails, you just assume a ram stick is bad, since that is the most likely problem. However, it really could be in any number of components, including the CPU itself.
One thing that I did find through all of this is that mprime can be a pretty good stress test for the system. It has a mode where it will calculate known primes, and if any are not correct then you know there is a hardware problem. My system failed this test before the mobo replacement.
Hopefully some of this information helps, or if you already solved your problem will come in handy down the road!
Oopps, I really meant to have my overly long comment in reply to the journal post and not one of the comments.
What log do ECC errors appear in? I didn't see anything like that.
This machine is still effectively dead. It shows many symptoms of having bad RAM, but aggressive memory testing with several different tools yields nothing.
Well, on AMD64 the kernel does not log the ECC errors by default, instead it writes them to a buffer which you can check using the mcelog command. My system has a cron script that will run mcelog periodically and create an entry in the /var/log/mcelog file. I think this is default behavior in Gentoo, I'm not sure about other distros.
I think this behavior might be unique for x86_64 systems. On others the errors might just be reported in the standard syslog-ng interface, but I'm not completely sure about this.
Have you tried mprime (I am referring to this version http://www.mersenne.org/freesoft.htm)? Supposedly this is a good stress test for the system and may reveal problems that something like memtest won't.
I am considering a fresh re-install to attempt a fix for this, but besides the inelegance of that solution
Post a Comment