

Dmitry Vostokov, 25th September 2009 (revised 10th November 2009)
http://www.dumpanalysis.org
Only memory dump analysis (static troubleshooting and postmortem debugging) or only software trace analysis (dynamic troubleshooting and postmortem debugging) is not always sufficient to resolve customer problems. Sometimes the combination of both methods provides the best result as the following synthetic case study shows.
The statistical application was consuming 50% of CPU time on a 2 processor machine (Spiking Thread pattern[1]). The forced user memory dump revealed normal stack traces (Stack Trace Collection pattern[2], only main thread stack trace is shown here):
0:000> ~*kL . 0 Id: 17c0.17c4 Suspend: 1 Teb: 7ffdf000 Unfrozen ChildEBP RetAddr WARNING: Stack unwind information not available. Following frames may be wrong. 0012faf8 004788ac ntdll!KiFastSystemCallRet 0012fb70 00905a2c Application+0x788ac 0012fb90 00905d94 Application+0x505a2c 0012fb98 00458bb2 Application+0x505d94 0012fbbc 00458f9d Application+0x58bb2 0012fd24 004587e7 Application+0x58f9d 0012fd40 0045b8d9 Application+0x587e7 0012fdb4 0042a4a6 Application+0x5b8d9 0012fdcc 7739b6e3 Application+0x2a4a6 0012fdf8 7739b874 user32!InternalCallWinProc+0x28 0012fe70 7739ba92 user32!UserCallWinProcCheckWow+0x151 0012fed8 773a16e5 user32!DispatchMessageWorker+0x327 0012fee8 00478868 user32!DispatchMessageA+0xf 0012ff54 009cf0f6 Application+0x78868 0012ffc0 77e6f23b Application+0x5cf0f6 0012fff0 00000000 kernel32!BaseProcessStart+0x23 […]
Unfortunately thread times statistics didn’t show which thread was spiking:
0:000> !runaway f User Mode Time Thread Time 0:17c4 0 days 0:00:06.968 1:17d8 0 days 0:00:00.031 2:17e8 0 days 0:00:00.015 6:1090 0 days 0:00:00.000 5:dc0 0 days 0:00:00.000 4:588 0 days 0:00:00.000 3:17fc 0 days 0:00:00.000 Kernel Mode Time Thread Time 0:17c4 0 days 0:00:02.562 2:17e8 0 days 0:00:00.031 6:1090 0 days 0:00:00.000 5:dc0 0 days 0:00:00.000 4:588 0 days 0:00:00.000 3:17fc 0 days 0:00:00.000 1:17d8 0 days 0:00:00.000 Elapsed Time Thread Time 0:17c4 0 days 0:14:16.454 1:17d8 0 days 0:14:16.266 2:17e8 0 days 0:14:15.704 3:17fc 0 days 0:14:11.204 4:588 0 days 0:14:09.876 5:dc0 0 days 0:13:59.297 6:1090 0 days 0:11:11.202
We requested an ETW trace (Event Trace for Windows) for all major components to see the dynamics of the system. The trace had a sudden gap of 5 minutes without any tracing activity (Discontinuity pattern[3]). Before that there was a significant software hook activity for 40 seconds visible from the start of the tracing so it could have been happening earlier and then after the silent gap there was another hook activity till the end of the tracing (about 10 seconds). We didn't know whether that sudden gap was during the time when the application did its calculations or it was done before or after that. This is an example of the repeated activity (Periodic Error[4] and Characteristic Message Block patterns[5]):
# Module PID TID Time Message
[…]
6870 HookA 2592 5476 08:24:34.437 Attempt to open CLSID\{ … }
in User hive
6871 HookA 2592 5476 08:24:34.437 User hive open failed: 2
6872 HookA 2592 5476 08:24:34.437 Attempt to open of Software\Classes\CLSID\{ … }
in System hive
6873 HookA 2592 5476 08:24:34.437 Hooked CoCreateInstanceEx: Calling original
CoCreateInstanceEx with original CLSID
[…]
Unfortunately not all possible components were selected for tracing, some common ones were obviously missing and the new full trace together with a simultaneous full user dump forced within the first minutes of CPU spiking were requested.
The new memory dump revealed a new thread that was missing in the first dump (Missing Thread pattern[6]):
7 Id: 17c0.1238 Suspend: 1 Teb: 7ffd5000 Unfrozen *** ERROR: Symbol file could not be found. Defaulted to export symbols for HookB.dll - ChildEBP RetAddr WARNING: Stack unwind information not available. Following frames may be wrong. 0461f738 658d1e58 ntdll!KiFastSystemCallRet 0461fe20 658d22c9 HookB+0x1e58 0461fe4c 0040cadd HookB+0x22c9 0461fe84 005c9467 Application+0xcadd 0461ff04 005ff7f5 Application+0x1c9467 0461ff44 0066a5f1 Application+0x1ff7f5 0461ff74 00428f7b Application+0x26a5f1 0461ffa4 00404f02 Application+0x28f7b 0461ffb8 77e6482f Application+0x4f02 0461ffec 00000000 kernel32!BaseThreadStart+0x34
Thread times statistics revealed that it was the thread with the most spent time in user and kernel modes:
0:000> !runaway f User Mode Time Thread Time 7:1238 0 days 0:00:07.687 0:17c4 0 days 0:00:06.062 1:17d8 0 days 0:00:00.031 6:1090 0 days 0:00:00.000 5:dc0 0 days 0:00:00.000 4:588 0 days 0:00:00.000 3:17fc 0 days 0:00:00.000 2:17e8 0 days 0:00:00.000 Kernel Mode Time Thread Time 7:1238 0 days 0:01:03.437 0:17c4 0 days 0:00:01.609 2:17e8 0 days 0:00:00.015 6:1090 0 days 0:00:00.000 5:dc0 0 days 0:00:00.000 4:588 0 days 0:00:00.000 3:17fc 0 days 0:00:00.000 1:17d8 0 days 0:00:00.000 Elapsed Time Thread Time 0:17c4 0 days 0:05:40.454 1:17d8 0 days 0:05:40.266 2:17e8 0 days 0:05:39.704 3:17fc 0 days 0:05:35.204 4:588 0 days 0:05:33.876 5:dc0 0 days 0:05:23.297 6:1090 0 days 0:02:35.202 7:1238 0 days 0:01:13.936
Using symbol files from the vendor we got the more refined stack trace:
7 Id: 17c0.1238 Suspend: 1 Teb: 7ffd5000 Unfrozen ChildEBP RetAddr Args to Child 0461f6f4 7c827369 7c839cfe 0461f778 02000000 ntdll!KiFastSystemCallRet 0461f6f8 7c839cfe 0461f778 02000000 0461f714 ntdll!ZwOpenKey+0xc 0461f738 658d1e58 02000000 0461f778 00000000 ntdll!RtlOpenCurrentUser+0x44 0461fe20 658d22c9 658d5018 658d5314 00000000 HookB!GetTimeZone+0xf8 0461fe4c 0040cadd 0461fe60 005fb5be 040fc3e4 HookB!GetLocalTime+0x89 WARNING: Stack unwind information not available. Following frames may be wrong. 0461fe84 005c9467 0461fe98 005c9661 0461ff04 Application+0xcadd 0461ff04 005ff7f5 0461ff4c 005ff810 0461ff44 Application+0x1c9467 0461ff44 0066a5f1 0461ff58 0066a5fb 0461ff74 Application+0x1ff7f5 0461ff74 00428f7b 0461ff88 00428f85 0461ffa4 Application+0x26a5f1 0461ffa4 00404f02 0461ffdc 00404a80 0461ffb8 Application+0x28f7b 0461ffb8 77e6482f 03a19418 00000000 00000000 Application+0x4f02 0461ffec 00000000 00404ed8 03a19418 00000000 kernel32!BaseThreadStart+0x34
The ETW trace confirmed HookB module involvement. We had massive HookB trace statement current of almost 30,000 msg/s and almost 99% trace statement density (Statement Density and Current pattern[7]):
# Module PID TID Time Message
[…]
555398 HookB 4276 744 13:09:15.319 Time zone registry key Software\Application\TimeZone\2
555399 HookB 4276 744 13:09:15.320 Use time zone:
Western Europe Standard Time
555400 HookB 4276 744 13:09:15.320 GetLocalTime (Application.exe) returns
14 hours 9 minutes 15 seconds
555401 HookB 4276 744 13:09:15.320 GetLocalTime (Application.exe) calling
GetTimeZone
[…]
Because this case study was synthesized from many real world examples it is not possible to provide its memory dump files and software trace logs. However, in order to help readers to play we have created a toy model problem. It is called MissingSpike application and it was created and compiled in Visual C++ 2008 Express Edition. The source code is very simple:
#include <windows.h>
#include <process.h>
void spiking_thread(void *param)
{
DWORD dwStart = GetTickCount();
while (GetTickCount() - dwStart < 1000*60)
{
OutputDebugString(L"Hello Computation!");
}
}
int _tmain(int argc, _TCHAR* argv[])
{
_beginthread(spiking_thread, 0, NULL);
while (true)
{
Sleep(100);
OutputDebugString(L"Hello Debugger!");
}
return 0;
}
The main thread creates the second one and the latter exits after 1 minute of CPU consumption. Both threads do software tracing. The second thread does it much more often to simulate high trace current and density.
If we run the application it starts consuming 50% CPU on a 2 processor machine. If we force a user dump after a few minutes we get this stack trace:
0:000:x86> ~*kL . 0 Id: 14d8.16a4 Suspend: 0 Teb: 7efdb000 Unfrozen ChildEBP RetAddr 003ffc0c 76c61380 ntdll_77170000!NtDelayExecution+0x15 003ffc74 76c60c88 kernel32!SleepEx+0x62 003ffc84 00a51064 kernel32!Sleep+0xf 003ffc94 00a511d6 MissingSpike!wmain+0x24 003ffcd8 76cdeccb MissingSpike!__tmainCRTStartup+0x10f 003ffce4 771ed24d kernel32!BaseThreadInitThunk+0xe 003ffd24 771ed45f ntdll_77170000!__RtlUserThreadStart+0x23 003ffd3c 00000000 ntdll_77170000!_RtlUserThreadStart+0x1b
If we didn’t have access to source code it would have been a complete surprise for us to see the absence of spiking threads:
0:000:x86> !runaway f User Mode Time Thread Time 0:16a4 0 days 0:00:00.000 Kernel Mode Time Thread Time 0:16a4 0 days 0:00:00.031 Elapsed Time Thread Time 0:16a4 0 days 0:10:49.639
We also ran DebugView before dumping the process and recorded the following debugging output:
00000188 125.66216278 [5336] Hello Debugger! 00000189 125.76219177 [5336] Hello Debugger! 00000190 125.86218262 [5336] Hello Debugger! 00000191 125.96219635 [5336] Hello Debugger! 00000192 126.06215668 [5336] Hello Debugger! 00000193 126.16222382 [5336] Hello Debugger! 00000194 126.26220703 [5336] Hello Debugger! 00000195 126.36220551 [5336] Hello Debugger! 00000196 126.46221161 [5336] Hello Debugger! 00000197 126.56221008 [5336] Hello Debugger! 00000198 126.66223907 [5336] Hello Debugger! 00000199 126.76223755 [5336] Hello Debugger! 00000200 126.86223602 [5336] Hello Debugger! 00000201 126.96223450 [5336] Hello Debugger! 00000202 127.06224823 [5336] Hello Debugger! 00000203 127.16225433 [5336] Hello Debugger! 00000204 127.26225281 [5336] Hello Debugger! 00000205 127.36226654 [5336] Hello Debugger! 00000206 127.46225739 [5336] Hello Debugger! 00000207 127.56227875 [5336] Hello Debugger! 00000208 127.66227722 [5336] Hello Debugger! 00000209 127.76228333 [5336] Hello Debugger! 00000210 127.86228943 [5336] Hello Debugger! 00000211 127.96231079 [5336] Hello Debugger! 00000212 128.06227112 [5336] Hello Debugger! 00000213 128.16230774 [5336] Hello Debugger! 00000214 128.26232910 [5336] Hello Debugger! 00000215 128.36231995 [5336] Hello Debugger! 00000216 128.46232605 [5336] Hello Debugger! 00000217 128.56236267 [5336] Hello Debugger! 00000218 128.66233826 [5336] Hello Debugger! 00000219 128.76234436 [5336] Hello Debugger! 00000220 128.86239624 [5336] Hello Debugger! 00000221 128.96234131 [5336] Hello Debugger! 00000222 129.06233215 [5336] Hello Debugger! 00000223 129.16236877 [5336] Hello Debugger! 00000224 129.26237488 [5336] Hello Debugger! 00000225 129.36238098 [5336] Hello Debugger! 00000226 129.46238708 [5336] Hello Debugger! 00000227 129.56239319 [5336] Hello Debugger! 00000228 129.66239929 [5336] Hello Debugger! 00000229 129.76242065 [5336] Hello Debugger!
Next time we tried to save a user dump as soon as Task Manager showed high CPU consumption:

The dump revealed the new thread:
0:000:x86> ~*kL . 0 Id: d5c.1708 Suspend: 0 Teb: 7efdb000 Unfrozen ChildEBP RetAddr 0035f4ac 76c61270 ntdll_77170000!ZwWaitForSingleObject+0x15 0035f51c 76c611d8 kernel32!WaitForSingleObjectEx+0xbe 0035f530 76c81961 kernel32!WaitForSingleObject+0x12 0035f794 76cc9060 kernel32!OutputDebugStringA+0xef 0035f7b4 00a5106b kernel32!OutputDebugStringW+0x41 0035f7c4 00a511d6 MissingSpike!wmain+0x2b 0035f808 76cdeccb MissingSpike!__tmainCRTStartup+0x10f 0035f814 771ed24d kernel32!BaseThreadInitThunk+0xe 0035f854 771ed45f ntdll_77170000!__RtlUserThreadStart+0x23 0035f86c 00000000 ntdll_77170000!_RtlUserThreadStart+0x1b 1 Id: d5c.d58 Suspend: 0 Teb: 7efd8000 Unfrozen ChildEBP RetAddr 00c7f6c0 76c61270 ntdll_77170000!ZwWaitForSingleObject+0x15 00c7f730 76c611d8 kernel32!WaitForSingleObjectEx+0xbe 00c7f744 76c83204 kernel32!WaitForSingleObject+0x12 00c7f9a8 76cc9060 kernel32!OutputDebugStringA+0x1ba 00c7f9c8 00a51027 kernel32!OutputDebugStringW+0x41 00c7f9dc 6ed932a3 MissingSpike!spiking_thread+0x27 00c7fa14 6ed9332b msvcr90!_callthreadstart+0x1b 00c7fa1c 76cdeccb msvcr90!_threadstart+0x5d 00c7fa28 771ed24d kernel32!BaseThreadInitThunk+0xe 00c7fa68 771ed45f ntdll_77170000!__RtlUserThreadStart+0x23 00c7fa80 00000000 ntdll_77170000!_RtlUserThreadStart+0x1b
Now we see that was a CPU consumer:
0:000:x86> !runaway f User Mode Time Thread Time 1:d58 0 days 0:00:04.102 0:1708 0 days 0:00:00.000 Kernel Mode Time Thread Time 1:d58 0 days 0:00:06.598 0:1708 0 days 0:00:00.031 Elapsed Time Thread Time 0:1708 0 days 0:00:49.505 1:d58 0 days 0:00:49.494
Less than a minute passed. After one minute the process becomes quiet from CPU consumption perspective:

Recorded DebugView trace has 246,100 lines in just one minute. Here is the fragment:
00245186 59.96057129 [3420] Hello Computation! 00245187 59.96061707 [3420] Hello Computation! 00245188 59.96066284 [3420] Hello Computation! 00245189 59.96070862 [3420] Hello Computation! 00245190 59.96075058 [3420] Hello Computation! 00245191 59.96082306 [3420] Hello Computation! 00245192 59.96085739 [3420] Hello Debugger! 00245193 59.96089554 [3420] Hello Computation! 00245194 59.96093750 [3420] Hello Computation! 00245195 59.96098328 [3420] Hello Computation! 00245196 59.96102905 [3420] Hello Computation! 00245197 59.96106720 [3420] Hello Computation! 00245198 59.96111298 [3420] Hello Computation! 00245199 59.96115494 [3420] Hello Computation! 00245200 59.96119690 [3420] Hello Computation! 00245201 59.96124268 [3420] Hello Computation!
The MissingSpike application and its symbols can be downloaded from:
http://www.dumpanalysis.org/Debugged/Sep09/download/MissingSpike.zip
[1] http://www.dumpanalysis.org/blog/index.php/2007/05/11/crash-dump-analysi...
[2] http://www.dumpanalysis.org/blog/index.php/2007/09/14/crash-dump-analysi...
[3] http://www.dumpanalysis.org/blog/index.php/2009/08/04/trace-analysis-pat...
[4] http://www.dumpanalysis.org/blog/index.php/2009/04/28/trace-analysis-pat...
[5] http://www.dumpanalysis.org/blog/index.php/2009/11/07/trace-analysis-pat...
[6] http://www.dumpanalysis.org/blog/index.php/2007/08/12/crash-dump-analysi...
[7] http://www.dumpanalysis.org/blog/index.php/2009/07/22/trace-analysis-pat...