![]() | Ned Batchelder : Blog | Code | Text | Site A server crash hunted down » Home : Blog : September 2003 |
A server crash hunted downSaturday 27 September 2003 These days, we have great interpreted languages, and great IDEs for our compiled languages, and usually we can work at a pretty high level. But occasionally, the only way to solve a problem is to get down to the nitty gritty and do some old-fashioned detective work. Here's the story of a server crash hunted down to its lair the hard way. Every morning at eight, the Domino server would crash (actually the namgr.exe process would crash), and it seemed likely to be Kubi's fault (Kubi runs as an extension manager add-in, meaning our DLL is loaded into all the running processes in the server). But the notes.rip file (a postmortem stack dump similar to Dr. Watson) didn't show any useful information, just a single stack frame. We surmised that the agent manager was trying to shut down after having run some agents, and somehow it was crashing. But manually shutting down the agent manager didn't reproduce the problem. I figured it would crash again at eight this morning, so I set up WinDbg as the just-in-time debugger on the system. WinDbg is difficult to use (it has a bizarre MDI-as-toolbars approach to windowing), but is very powerful and very lightweight (it doesn't have to be installed: just run it). Sure enough, at eight this morning, the agent manager crashed again, and WinDbg came up showing the crash. Unfortunately, since we were running a release build, the optimized code produced the same bad stack trace. Only a single frame was presented, and it had a dubious method name associated with it. My choices were to install a debug build and wait another day, or dig harder now. I'd been looking at this crash for a day already, and didn't feel like waiting any more. What information did I have? basically just the instruction pointer, and the disassembled code that was executing when it crased. I looked at the disassembly window in WinDbg: 06a89170 51 push ecx Maybe I could trace that machine code to a particular place in the built code to at least find out which routine we were in. I used Perforce to get the sources from a few days ago (when the running software was built). I modified the Visual Studio projects to produce .cod files (object-level listings that combine source code, machine code and assembler), and rebuilt the release build. This produced a 74,000-line .cod file for the class that WinDbg indicated was at fault. I searched the file for "8d 4c 24 04" (the first longish value in the disassembly). There were 40 occurrences. Taking into account all of the machine code from the snippet above, there was only one place in the compiled code that matched completely: ; 418 : inline _bstr_t ICertificate::GetIssuerName ( ) { While release builds are harder to debug because the optimizer scrambles code, here the optimizer helped me. Unoptimized code has more common code snippets due to inlining, while the optimizer will tend to shuffle code around, making the code more unique throughout. The .cod file pinpointed the routine that had failed (it wasn't the one the debugger had named). Here's where I got lucky. The routine was only called from one place, and that routine was only called from one place, and that was during shutdown, so our theory had been validated. The problem became easy to solve at that point: we were shutting down facilities that had never been started up. There's something really satisfying about digging in and persevering to find a problem, even through "unreasonable" steps like rebuilding the whole product to be able to read the machine code. How great it is to have a bug lying dead on the floor, with a sharp knife thrust cleanly through its middle!
tagged:
debugging» 3 reactions | |
Comments
Cool. Crash detective work can be so rewarding when you have that "Aha!" moment hen you've found the cause.
Remember the "creepy threads" problem?
Followup question: so did you modify your build process to generate .cod files so you won't have to go through this the next time around?
--bob
We haven't changed the build process yet, because .cod files are really big and they seem to slow down the process quite a bit. It feels like the kind of thing we'd want to do once we knew the build was an "important" build, but that means going back and building it again, etc.
Here's a question, while I'm wearing my "think outside the box" hat:
Would the crash happen two minutes after you set the clock to 7:58am? :-)
Best,
-=Alan
Add a comment: