Memory Dump and Trace Analysis: A Unified Pattern Approach

Dmitry Vostokov, 25th September 2009 (revised 10th November 2009)
http://www.dumpanalysis.org

Only memory dump analysis (static trouble­shooting and postmortem debugging) or only software trace analysis (dynamic troubleshoot­ing and postmortem debugging) is not always sufficient to resolve customer problems. Sometimes the combination of both methods provides the best result as the following synthetic case study shows.

The statistical application was consuming 50% of CPU time on a 2 processor machine (Spiking Thread pat­tern[1]). The forced user memory dump revealed normal stack traces (Stack Trace Collection pattern[2], only main thread stack trace is shown here):

0:000> ~*kL

.  0  Id: 17c0.17c4 Suspend: 1 Teb: 7ffdf000 Unfrozen
ChildEBP RetAddr  
WARNING: Stack unwind information not available. Following frames may be wrong.
0012faf8 004788ac ntdll!KiFastSystemCallRet
0012fb70 00905a2c Application+0x788ac
0012fb90 00905d94 Application+0x505a2c
0012fb98 00458bb2 Application+0x505d94
0012fbbc 00458f9d Application+0x58bb2
0012fd24 004587e7 Application+0x58f9d
0012fd40 0045b8d9 Application+0x587e7
0012fdb4 0042a4a6 Application+0x5b8d9
0012fdcc 7739b6e3 Application+0x2a4a6
0012fdf8 7739b874 user32!InternalCallWinProc+0x28
0012fe70 7739ba92 user32!UserCallWinProcCheckWow+0x151
0012fed8 773a16e5 user32!DispatchMessageWorker+0x327
0012fee8 00478868 user32!DispatchMessageA+0xf
0012ff54 009cf0f6 Application+0x78868
0012ffc0 77e6f23b Application+0x5cf0f6
0012fff0 00000000 kernel32!BaseProcessStart+0x23

[…]

Unfortunately thread times statistics didn’t show which thread was spiking:

0:000> !runaway f
 User Mode Time
  Thread       Time
   0:17c4      0 days 0:00:06.968
   1:17d8      0 days 0:00:00.031
   2:17e8      0 days 0:00:00.015
   6:1090      0 days 0:00:00.000
   5:dc0       0 days 0:00:00.000
   4:588       0 days 0:00:00.000
   3:17fc      0 days 0:00:00.000
 Kernel Mode Time
  Thread       Time
   0:17c4      0 days 0:00:02.562
   2:17e8      0 days 0:00:00.031
   6:1090      0 days 0:00:00.000
   5:dc0       0 days 0:00:00.000
   4:588       0 days 0:00:00.000
   3:17fc      0 days 0:00:00.000
   1:17d8      0 days 0:00:00.000
 Elapsed Time
  Thread       Time
   0:17c4      0 days 0:14:16.454
   1:17d8      0 days 0:14:16.266
   2:17e8      0 days 0:14:15.704
   3:17fc      0 days 0:14:11.204
   4:588       0 days 0:14:09.876
   5:dc0       0 days 0:13:59.297
   6:1090      0 days 0:11:11.202

We requested an ETW trace (Event Trace for Windows) for all major components to see the dynamics of the sys­tem. The trace had a sudden gap of 5 minutes without any tracing activity (Discontinuity pattern[3]). Before that there was a significant software hook activity for 40 seconds visible from the start of the tracing so it could have been happening earlier and then after the silent gap there was another hook activity till the end of the tracing (about 10 seconds). We didn't know whether that sudden gap was during the time when the application did its calculations or it was done before or after that. This is an example of the repeated activity (Periodic Error[4] and Characteristic Message Block patterns[5]):

#      Module  PID    TID   Time         Message 
[…]
6870   HookA  2592  5476  08:24:34.437  Attempt to open CLSID\{ … } 
                                        in User hive  
6871   HookA  2592  5476  08:24:34.437  User hive open failed: 2    
6872   HookA  2592  5476  08:24:34.437  Attempt to open of Software\Classes\CLSID\{ … } 
                                        in System hive 
6873   HookA  2592  5476  08:24:34.437  Hooked CoCreateInstanceEx: Calling original 
                                        CoCreateInstanceEx with original CLSID  
[…]

Unfortunately not all possible components were selected for tracing, some common ones were obviously missing and the new full trace together with a simultaneous full user dump forced within the first minutes of CPU spiking were requested.

The new memory dump revealed a new thread that was missing in the first dump (Missing Thread pattern[6]):

   7  Id: 17c0.1238 Suspend: 1 Teb: 7ffd5000 Unfrozen
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for HookB.dll - 
ChildEBP RetAddr  
WARNING: Stack unwind information not available. Following frames may be wrong.
0461f738 658d1e58 ntdll!KiFastSystemCallRet
0461fe20 658d22c9 HookB+0x1e58
0461fe4c 0040cadd HookB+0x22c9
0461fe84 005c9467 Application+0xcadd
0461ff04 005ff7f5 Application+0x1c9467
0461ff44 0066a5f1 Application+0x1ff7f5
0461ff74 00428f7b Application+0x26a5f1
0461ffa4 00404f02 Application+0x28f7b
0461ffb8 77e6482f Application+0x4f02
0461ffec 00000000 kernel32!BaseThreadStart+0x34

Thread times statistics revealed that it was the thread with the most spent time in user and kernel modes:

0:000> !runaway f
 User Mode Time
  Thread       Time
   7:1238      0 days 0:00:07.687
   0:17c4      0 days 0:00:06.062
   1:17d8      0 days 0:00:00.031
   6:1090      0 days 0:00:00.000
   5:dc0       0 days 0:00:00.000
   4:588       0 days 0:00:00.000
   3:17fc      0 days 0:00:00.000
   2:17e8      0 days 0:00:00.000
 Kernel Mode Time
  Thread       Time
   7:1238      0 days 0:01:03.437
   0:17c4      0 days 0:00:01.609
   2:17e8      0 days 0:00:00.015
   6:1090      0 days 0:00:00.000
   5:dc0       0 days 0:00:00.000
   4:588       0 days 0:00:00.000
   3:17fc      0 days 0:00:00.000
   1:17d8      0 days 0:00:00.000
 Elapsed Time
  Thread       Time
   0:17c4      0 days 0:05:40.454
   1:17d8      0 days 0:05:40.266
   2:17e8      0 days 0:05:39.704
   3:17fc      0 days 0:05:35.204
   4:588       0 days 0:05:33.876
   5:dc0       0 days 0:05:23.297
   6:1090      0 days 0:02:35.202
   7:1238      0 days 0:01:13.936

Using symbol files from the vendor we got the more re­fined stack trace:

7  Id: 17c0.1238 Suspend: 1 Teb: 7ffd5000 Unfrozen
ChildEBP RetAddr  Args to Child              
0461f6f4 7c827369 7c839cfe 0461f778 02000000 ntdll!KiFastSystemCallRet
0461f6f8 7c839cfe 0461f778 02000000 0461f714 ntdll!ZwOpenKey+0xc
0461f738 658d1e58 02000000 0461f778 00000000 ntdll!RtlOpenCurrentUser+0x44
0461fe20 658d22c9 658d5018 658d5314 00000000 HookB!GetTimeZone+0xf8
0461fe4c 0040cadd 0461fe60 005fb5be 040fc3e4 HookB!GetLocalTime+0x89
WARNING: Stack unwind information not available. Following frames may be wrong.
0461fe84 005c9467 0461fe98 005c9661 0461ff04 Application+0xcadd
0461ff04 005ff7f5 0461ff4c 005ff810 0461ff44 Application+0x1c9467
0461ff44 0066a5f1 0461ff58 0066a5fb 0461ff74 Application+0x1ff7f5
0461ff74 00428f7b 0461ff88 00428f85 0461ffa4 Application+0x26a5f1
0461ffa4 00404f02 0461ffdc 00404a80 0461ffb8 Application+0x28f7b
0461ffb8 77e6482f 03a19418 00000000 00000000 Application+0x4f02
0461ffec 00000000 00404ed8 03a19418 00000000 kernel32!BaseThreadStart+0x34

The ETW trace confirmed HookB module involvement. We had massive HookB trace statement current of almost 30,000 msg/s and almost 99% trace statement density (Statement Density and Current pattern[7]):

#      Module  PID    TID   Time          Message 
[…]
555398 HookB  4276   744   13:09:15.319   Time zone registry key Software\Application\TimeZone\2
555399 HookB  4276   744   13:09:15.320   Use time zone: 
                                          Western Europe Standard Time     
555400 HookB  4276   744   13:09:15.320   GetLocalTime (Application.exe) returns 
                                          14 hours 9 minutes 15 seconds       
555401 HookB  4276   744   13:09:15.320   GetLocalTime (Application.exe) calling 
                                          GetTimeZone        
[…]

Because this case study was synthesized from many real world examples it is not possible to provide its memory dump files and software trace logs. However, in order to help readers to play we have created a toy model problem. It is called MissingSpike application and it was created and compiled in Visual C++ 2008 Express Edition. The source code is very simple:

#include <windows.h>
#include <process.h>

void spiking_thread(void *param)
{
   DWORD dwStart = GetTickCount();
   while (GetTickCount() - dwStart < 1000*60)
   {
      OutputDebugString(L"Hello Computation!");
   }
}

int _tmain(int argc, _TCHAR* argv[])
{
   _beginthread(spiking_thread, 0, NULL);

   while (true)
   {
      Sleep(100);
      OutputDebugString(L"Hello Debugger!");
   }

   return 0;
}

The main thread creates the second one and the latter ex­its after 1 minute of CPU consumption. Both threads do software tracing. The second thread does it much more often to simulate high trace current and density.

If we run the application it starts consuming 50% CPU on a 2 processor machine. If we force a user dump after a few minutes we get this stack trace:

0:000:x86> ~*kL

.  0  Id: 14d8.16a4 Suspend: 0 Teb: 7efdb000 Unfrozen
ChildEBP RetAddr  
003ffc0c 76c61380 ntdll_77170000!NtDelayExecution+0x15
003ffc74 76c60c88 kernel32!SleepEx+0x62
003ffc84 00a51064 kernel32!Sleep+0xf
003ffc94 00a511d6 MissingSpike!wmain+0x24
003ffcd8 76cdeccb MissingSpike!__tmainCRTStartup+0x10f
003ffce4 771ed24d kernel32!BaseThreadInitThunk+0xe
003ffd24 771ed45f ntdll_77170000!__RtlUserThreadStart+0x23
003ffd3c 00000000 ntdll_77170000!_RtlUserThreadStart+0x1b

If we didn’t have access to source code it would have been a complete surprise for us to see the absence of spiking threads:

0:000:x86> !runaway f
 User Mode Time
  Thread       Time
   0:16a4      0 days 0:00:00.000
 Kernel Mode Time
  Thread       Time
   0:16a4      0 days 0:00:00.031
 Elapsed Time
  Thread       Time
   0:16a4      0 days 0:10:49.639

We also ran DebugView before dumping the process and recorded the following debugging output:

00000188      125.66216278  [5336] Hello Debugger!
00000189      125.76219177  [5336] Hello Debugger!
00000190      125.86218262  [5336] Hello Debugger!
00000191      125.96219635  [5336] Hello Debugger!
00000192      126.06215668  [5336] Hello Debugger!
00000193      126.16222382  [5336] Hello Debugger!
00000194      126.26220703  [5336] Hello Debugger!
00000195      126.36220551  [5336] Hello Debugger!
00000196      126.46221161  [5336] Hello Debugger!
00000197      126.56221008  [5336] Hello Debugger!
00000198      126.66223907  [5336] Hello Debugger!
00000199      126.76223755  [5336] Hello Debugger!
00000200      126.86223602  [5336] Hello Debugger!
00000201      126.96223450  [5336] Hello Debugger!
00000202      127.06224823  [5336] Hello Debugger!
00000203      127.16225433  [5336] Hello Debugger!
00000204      127.26225281  [5336] Hello Debugger!
00000205      127.36226654  [5336] Hello Debugger!
00000206      127.46225739  [5336] Hello Debugger!
00000207      127.56227875  [5336] Hello Debugger!
00000208      127.66227722  [5336] Hello Debugger!
00000209      127.76228333  [5336] Hello Debugger!
00000210      127.86228943  [5336] Hello Debugger!
00000211      127.96231079  [5336] Hello Debugger!
00000212      128.06227112  [5336] Hello Debugger!
00000213      128.16230774  [5336] Hello Debugger!
00000214      128.26232910  [5336] Hello Debugger!
00000215      128.36231995  [5336] Hello Debugger!
00000216      128.46232605  [5336] Hello Debugger!
00000217      128.56236267  [5336] Hello Debugger!
00000218      128.66233826  [5336] Hello Debugger!
00000219      128.76234436  [5336] Hello Debugger!
00000220      128.86239624  [5336] Hello Debugger!
00000221      128.96234131  [5336] Hello Debugger!
00000222      129.06233215  [5336] Hello Debugger!
00000223      129.16236877  [5336] Hello Debugger!
00000224      129.26237488  [5336] Hello Debugger!
00000225      129.36238098  [5336] Hello Debugger!
00000226      129.46238708  [5336] Hello Debugger!
00000227      129.56239319  [5336] Hello Debugger!
00000228      129.66239929  [5336] Hello Debugger!
00000229      129.76242065  [5336] Hello Debugger!

Next time we tried to save a user dump as soon as Task Manager showed high CPU consumption:

The dump revealed the new thread:

0:000:x86> ~*kL

.  0  Id: d5c.1708 Suspend: 0 Teb: 7efdb000 Unfrozen
ChildEBP RetAddr  
0035f4ac 76c61270 ntdll_77170000!ZwWaitForSingleObject+0x15
0035f51c 76c611d8 kernel32!WaitForSingleObjectEx+0xbe
0035f530 76c81961 kernel32!WaitForSingleObject+0x12
0035f794 76cc9060 kernel32!OutputDebugStringA+0xef
0035f7b4 00a5106b kernel32!OutputDebugStringW+0x41
0035f7c4 00a511d6 MissingSpike!wmain+0x2b
0035f808 76cdeccb MissingSpike!__tmainCRTStartup+0x10f
0035f814 771ed24d kernel32!BaseThreadInitThunk+0xe
0035f854 771ed45f ntdll_77170000!__RtlUserThreadStart+0x23
0035f86c 00000000 ntdll_77170000!_RtlUserThreadStart+0x1b

1  Id: d5c.d58 Suspend: 0 Teb: 7efd8000 Unfrozen
ChildEBP RetAddr  
00c7f6c0 76c61270 ntdll_77170000!ZwWaitForSingleObject+0x15
00c7f730 76c611d8 kernel32!WaitForSingleObjectEx+0xbe
00c7f744 76c83204 kernel32!WaitForSingleObject+0x12
00c7f9a8 76cc9060 kernel32!OutputDebugStringA+0x1ba
00c7f9c8 00a51027 kernel32!OutputDebugStringW+0x41
00c7f9dc 6ed932a3 MissingSpike!spiking_thread+0x27
00c7fa14 6ed9332b msvcr90!_callthreadstart+0x1b
00c7fa1c 76cdeccb msvcr90!_threadstart+0x5d
00c7fa28 771ed24d kernel32!BaseThreadInitThunk+0xe
00c7fa68 771ed45f ntdll_77170000!__RtlUserThreadStart+0x23
00c7fa80 00000000 ntdll_77170000!_RtlUserThreadStart+0x1b

Now we see that was a CPU consumer:

0:000:x86> !runaway f
 User Mode Time
  Thread       Time
   1:d58       0 days 0:00:04.102
   0:1708      0 days 0:00:00.000
 Kernel Mode Time
  Thread       Time
   1:d58       0 days 0:00:06.598
   0:1708      0 days 0:00:00.031
 Elapsed Time
  Thread       Time
   0:1708      0 days 0:00:49.505
   1:d58       0 days 0:00:49.494

Less than a minute passed. After one minute the process becomes quiet from CPU consumption perspective:

Recorded DebugView trace has 246,100 lines in just one minute. Here is the fragment:

00245186      59.96057129 [3420] Hello Computation!
00245187      59.96061707 [3420] Hello Computation!
00245188      59.96066284 [3420] Hello Computation!
00245189      59.96070862 [3420] Hello Computation!
00245190      59.96075058 [3420] Hello Computation!
00245191      59.96082306 [3420] Hello Computation!
00245192      59.96085739 [3420] Hello Debugger!
00245193      59.96089554 [3420] Hello Computation!
00245194      59.96093750 [3420] Hello Computation!
00245195      59.96098328 [3420] Hello Computation!
00245196      59.96102905 [3420] Hello Computation!
00245197      59.96106720 [3420] Hello Computation!
00245198      59.96111298 [3420] Hello Computation!
00245199      59.96115494 [3420] Hello Computation!
00245200      59.96119690 [3420] Hello Computation!
00245201      59.96124268 [3420] Hello Computation!

The MissingSpike application and its symbols can be downloaded from:

http://www.dumpanalysis.org/Debugged/Sep09/download/MissingSpike.zip

[1] http://www.dumpanalysis.org/blog/index.php/2007/05/11/crash-dump-analysi...
[2] http://www.dumpanalysis.org/blog/index.php/2007/09/14/crash-dump-analysi...
[3] http://www.dumpanalysis.org/blog/index.php/2009/08/04/trace-analysis-pat...
[4] http://www.dumpanalysis.org/blog/index.php/2009/04/28/trace-analysis-pat...
[5] http://www.dumpanalysis.org/blog/index.php/2009/11/07/trace-analysis-pat...
[6] http://www.dumpanalysis.org/blog/index.php/2007/08/12/crash-dump-analysi...
[7] http://www.dumpanalysis.org/blog/index.php/2009/07/22/trace-analysis-pat...