Debugging the Debugger

Jason Copenhaver, 2nd of August, 2011
Winner of Tell Your Windows Debugging Story Annual Competition
jcopenhaver[0x40]accusoft[0x2E]com

Our company has an internal library that allows teams in our company to package DLLs as resources inside a single DLL. This means our customers only have to worry about distributing a single file, and we are sure that the DLL we deliver to the customer always runs with the DLLs that we tested with. One of the teams using this library was running into a problem where Visual Studio 2008 was crashing when they were debugging their 64-bit product. It’s never good when you are doing something so wrong that your debugger crashes. Needing to figure out what was going on I launched Visual Studio 2008 (VS), attached a copy of WinDbg and then repeated the steps in VS to make it crash.

The crash occurred exactly as reported in the defect ticket.

(564.1298): Break instruction exception - code 80000003 (first chance)
eax=00000000 ebx=00000000 ecx=7712e1fb edx=030bc5e9 esi=00270000 edi=08c11030
eip=77100004 esp=030bc838 ebp=030bc8b4 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
ntdll!DbgBreakPoint:
77100004 cc              int     3

A quick look at the stack and I had my first suspect.

0:001> k
ChildEBP RetAddr  
030bc834 7719bd9e ntdll!DbgBreakPoint
030bc8b4 7719ca95 ntdll!RtlReportCriticalFailure+0x2e
030bc8c4 7719cb83 ntdll!RtlpReportHeapFailure+0x21
030bc8f8 77143498 ntdll!RtlpLogHeapFailure+0xa1
030bc924 764a3f8b ntdll!RtlFreeHeap+0x60
030bc938 72a03c1b kernel32!HeapFree+0x14
030bc984 6a113a14 MSVCR90!free+0xcd [f:\dd\vctools\crt_bld\self_x86\crt\src\free.c @ 110] ← 
← NatDbgDE (the VS debugger)is probably freeing something it shouldn’t be.
030bc9a0 6a0ecc9d NatDbgDE!LoadPdata+0x142
030bcc3c 6a0bd3ee NatDbgDE!LoadCVFile+0x1a4
030bd094 6a0bd52d NatDbgDE!AfterPDBOpen+0xef
030bdd10 6a0ba2ce NatDbgDE!OLStart+0x11e
030bdd4c 6a0ba36e NatDbgDE!LoadOmfForReal+0x23
030bdd64 6a0ba3d9 NatDbgDE!LoadSymbols+0x43
030bdd7c 6a0ba499 NatDbgDE!OLLoadOmf+0x55
030bddac 6a0b9dc1 NatDbgDE!SHLoadDll+0xd5
030bddfc 6a0b9cfa NatDbgDE!CSymbolHandlerX::SHLoadDll+0x5a
030bde94 6a0b9685 NatDbgDE!CModule::Load+0x1a1
030bdefc 6a0ab717 NatDbgDE!CNativeProcess::NotifyModLoad+0xc8
030be03c 6a0ab682 NatDbgDE!EngineCallback+0xb3
030be068 6a0b959d NatDbgDE!EMCallBackDB+0x4c

The Microsoft debugging team has put a lot of work into the heuristics based !analyze command in WinDbg so I always like to see what it thinks of a crash before I go too far.

0:001> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

FAULTING_IP:
ntdll!DbgBreakPoint+0
77100004 cc              int     3

EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
.exr 0xffffffffffffffff
ExceptionAddress: 77100004 (ntdll!DbgBreakPoint)
  ExceptionCode: 80000003 (Break instruction exception)
 ExceptionFlags: 00000000
NumberParameters: 1
  Parameter[0]: 00000000

FAULTING_THREAD:  00001298
PROCESS_NAME:  devenv.exe
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.
EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid
EXCEPTION_PARAMETER1:  00000000
NTGLOBALFLAG:  400

APPLICATION_VERIFIER_FLAGS:  0

…

FOLLOWUP_IP:
NatDbgDE!LoadPdata+142
6a113a14 59              pop     ecx

SYMBOL_STACK_INDEX:  3

SYMBOL_NAME:  NatDbgDE!LoadPdata+142
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: NatDbgDE
IMAGE_NAME:  NatDbgDE.dll
DEBUG_FLR_IMAGE_TIMESTAMP:  488f2cca
FAILURE_BUCKET_ID:  ACTIONABLE_HEAP_CORRUPTION_heap_failure_block_not_busy_80000003_
NatDbgDE.dll!LoadPdata

BUCKET_ID:  APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_block_not_busy_
DOUBLE_FREE_NatDbgDE!LoadPdata+142

...

Followup: MachineOwner
---------

Well look at that “APPLICATION_FAULT_ACTIONABLE_HEAP_CORRUPTION_heap_failure_block_not_busy_DOUBLE_FREE_NatDbgDE!LoadPdata+142” the CRT is nice enough to point us to the code that did a double free. Looking at the stack trace with function names like SHLoadDll, AfterPdbOpen, and LoadPdata led me to believe this code was trying to parse a DLL. The .pdata section of the PE32+ format contains a table of IMAGE_RUNTIME_FUNCTION_ENTRY structures. This describes the Begin and End address of each function and the location of the stack unwind information needed to handle exceptions. Now all we have to do is understand NatDbgDE!LoadPdata well enough to see why it was double freeing some memory in this case.

Looking at the disassembly we can see that it allocates a buffer (.text:45BC394B), saves it to the stack (.text:45BC3952) and then passes it to a function call (.text:45BC3963) to read in the .pdata section.

.text:45BC391A                 mov     eax, edi              ; copy size to eax
.text:45BC391C                 div     ecx
.text:45BC391E                 mov     dword ptr [ebx+8], 1
.text:45BC3925                 mov     ecx, esi
.text:45BC3927                 mov     [ebp+arg_0], eax      ; store size on stack
...
.text:45BC394B                 call    ds:__imp__malloc      ; allocate buffer
.text:45BC3951                 pop     ecx
.text:45BC3952                 mov     [ebp+arg_10], eax     ; save buffer
.text:45BC3955                 test    eax, eax
.text:45BC3957                 jnz     short loc_45BC395D
...
.text:45BC395D                 mov     edx, [esi]
.text:45BC395F                 push    edi
.text:45BC3960                 push    eax                  ; pass buffer
.text:45BC3961                 mov     ecx, esi
.text:45BC3963                 call    dword ptr [edx+10h]  ; fill in the buffer

It then clears edi (.text:45BC396C), save it on the stack (.text:45BC3979), followed by inspecting the buffer that our .pdata section was copied into (.text:45BC3984).

.text:45BC396C                 xor     edi, edi             ; clear edi
.text:45BC396E                 call    ?SHGetPlatform@@YG?AW4_MPT@@PAUHPDS__@@@Z
.text:45BC3973                 dec     eax
.text:45BC3974                 jz      short loc_45BC39A0
.text:45BC3976                 dec     eax
.text:45BC3977                 jnz     short loc_45BC3942
.text:45BC3979                 cmp     [ebp+arg_0], edi     ; save edi
.text:45BC397C                 jbe     short loc_45BC39E4
.text:45BC397E                 mov     ecx, [ebp+arg_10]    ; get our buffer
.text:45BC3981                 add     ecx, 8               ; buffer+8
.text:45BC3984
.text:45BC3984 loc_45BC3984:
.text:45BC3984                 cmp     dword ptr [ecx-8], 0 ; Is BeginAddress 0 ?
.text:45BC3988                 jz      short loc_45BC39C5   ; get out of here

If the first value of the .pdata section, the BeginAddress, is 0 it jumps to some cleanup code.

.text:45BC39C5                 cmp     edi, [ebp+arg_0]     ; compare current edi
.text:45BC39C8                 jnb     short loc_45BC39E4   ; with buffer size
.text:45BC39CA                 mov     [ebp+arg_0], edi     ; save edi
.text:45BC39CD                 imul    edi, [ebp+var_4]     ; edi*sizeof(FuncEntry)
.text:45BC39D1                 push    edi             
.text:45BC39D2                 push    [ebp+arg_10]    
.text:45BC39D5                 call    ds:__imp__realloc    ; realloc our buffer
.text:45BC39DB                 pop     ecx
.text:45BC39DC                 pop     ecx
.text:45BC39DD                 test    eax, eax
.text:45BC39DF                 jz      short loc_45BC3A0B

This last bit of code is where the error occurs. It reallocs the buffer to edi*sizeof(IMAGE_RUNTIME_FUNCTION_ENTRY) but since it errored out on the very first entry edi is still 0. The important part from the realloc documentation says “If size is zero, then the block pointed to by memblock is freed; the return value is NULL, and memblock is left pointing at a freed block.” So this code has just freed our buffer and returned NULL, so the test of eax at .text:45BC39DD will be zero and the code jumps to .text:45BC3A0B.

.text:45BC3A0B                 push    [ebp+arg_10]      ; our original buffer
.text:45BC3A0E                 call    ds:__imp__free    ; LoadPdata+142 fault
.text:45BC3A14                 pop     ecx

Which immediately calls free and it faults.

Tracing through the NatDbgDE!LoadPdata function we see that a double free can occur if we attempt to parse a .pdata section that starts with 0. Then the question becomes why do we have a .pdata section that starts with 0 and what can we do about it?

When we unpack the embedded DLL we write a temporary DLL to disk with just enough information to trick the OS into allocating the correct amount of memory and then load that DLL. This gets us a correct chunk of memory that we can then copy the real contents of the DLL into, doing all of the fix-ups, and even some function level hooks. Loading an actual DLL allows us to get useful DLL_THREAD_ATTACH, DLL_THREAD_DETACH, etc. messages from the OS. When building this temporary DLL we start from zero’d memory and copy only the parts that the OS needs to do the load. So we create a .pdata section that we’ve now learned is invalid for the debugger, but that is not invalid for the OS loader. We considered removing the .pdata section from the temporary DLL but then the OS can’t handle exceptions properly as it has no exception unwind information. The simple fix was to make sure we wrote at least one IMAGE_RUNTIME_FUNCTION_ENTRY to the .pdata section and that made Visual Studio happy. I’ve also added realloc size checks to my code review checklist to ensure this type of problem doesn’t happen in my code.