

Philip Hamer, 28th of December, 2010
phiham[0x40]hotmail[0x2E]com
In the following article I will share my own recent experience debugging a software crash. Along the way, I’ll share and promote what I think is a good way to reason about these types of bugs.
Implementing software hooks is a time-honored technique for extending functionality of a third-party software component for which you have no source code. It is a fine addition to any clever programmer’s bag of tricks. Methods of implementing hooks can range from well-known “safe” techniques to down-right ugly-yet-elegant hacks. These can run the gamut from window subclassing (via SetWindowLongPtr) to the use of SetWindowsHookEx; from import address table hooking to “detour” or “trampoline” functions that rely on overwriting the first several bytes of a function in memory with a “jmp” instruction to your hook.
As for the specific bug I was investigating, it involved an application that used the Internet Explorer web browser control. The hook, in this case, was done through a standard COM interface method call, specifically the IDispatchEx interface. The purpose of the hook was to get notification of when a certain DOM method was called by a JavaScript function. There were reports of crashes, though, during internal testing. After some investigation it was determined that the crash was reproducible only in certain environments and only while navigating to a few specific web sites. So I open WinDbg in a “bad” environment, attach to our process, and navigate to a “bad” web site. I’m greeted with this.
(bd4.d20): Access violation - code c0000005 (first chance) First chance exceptions are reported before any exception handling. This exception may be expected and handled. eax=00000000 ebx=00000000 ecx=00000000 edx=00000020 esi=01bda674 edi=07aeb804 eip=3cecf011 esp=01bda640 ebp=01bda644 iopl=0 nv up ei pl zr na pe nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246 mshtml!CChildIterator::CChildIterator+0x30: 3cecf011 8b4010 mov eax,dword ptr [eax+10h] ds:0023:00000010=????????
We can see this is a NULL pointer dereference in the CChildIterator constructor in the mshtml module.
OK, so it has been reproduced. Now I modify my code so that my hooks are never “hooked up”, and I repeat the exercise. This time: no crash. I make another modification and reinstall my hooks, except this time I remove all the code in my hook function. Again: no crash. What is going on? It seems that there is something about my hook function that causes a crash within mshtml as it tries to construct a “CChildIterator” object. Let’s investigate the crash some more.
After more debugging, I find out that the specific hook responsible is my hooking of the “createElement” method. That is, I get notified whenever script calls createElement, and I can do whatever I want right after that in my hook function before returning control to the browser. If I look at the stack trace, I can attempt to infer a few things, as well.
0:005> k n 0x20 # ChildEBP RetAddr 00 01bda644 3d087128 mshtml!CChildIterator::CChildIterator+0x30 01 01bdc6c0 3cf8fd02 mshtml!CLinkElement::HandleLinkedObjects+0x23f 02 01bdc6d4 3cf78565 mshtml!CLinkElement::OnPropertyChange+0x91 03 01bdc730 3cf44ab9 mshtml!BASICPROPPARAMS::SetStringProperty+0x1d7 04 01bdc754 3cf8fd71 mshtml!BASICPROPPARAMS::SetUrlProperty+0x32 05 01bdc76c 3cf8fd44 mshtml!CBase::put_UrlHelper+0x24 06 01bdc788 3cf784b4 mshtml!CBase::put_Url+0x24 07 01bdc7b0 3cf586f8 mshtml!GS_BSTR+0x84 08 01bdc848 3cf74f63 mshtml!CBase::ContextInvokeEx+0x4ef 09 01bdc878 3d02f606 mshtml!CElement::ContextInvokeEx+0x70 0a 01bdc8ac 3ced638a mshtml!CLinkElement::ContextThunk_InvokeExReady+0x63 0b 01bdc8e0 3cf1adf0 mshtml!CBase::Invoke+0x6e 0c 01bdc97c 3cf1ad0d mshtml!CBase::setAttribute+0xc3 0d 01bdc9e4 3cf586f8 mshtml!Method_void_BSTR_VARIANT_oDoLONG+0xb8 0e 01bdca7c 3cf74f63 mshtml!CBase::ContextInvokeEx+0x4ef 0f 01bdcaac 3d02f606 mshtml!CElement::ContextInvokeEx+0x70 10 01bdcae0 75c729d7 mshtml!CLinkElement::ContextThunk_InvokeExReady+0x63 11 01bdcb18 75c72947 jscript!IDispatchExInvokeEx2+0xac 12 01bdcb50 75c731e5 jscript!IDispatchExInvokeEx+0x56 13 01bdcbc0 75c71c0a jscript!InvokeDispatchEx+0x78 14 01bdcc08 75c71211 jscript!VAR::InvokeByName+0xba 15 01bdcc48 75c711c6 jscript!VAR::InvokeDispName+0x43 16 01bdcc6c 75c7311d jscript!VAR::InvokeByDispID+0xfd 17 01bdcd24 75c71123 jscript!CScriptRuntime::Run+0x176c 18 01bdcd3c 75c70f8a jscript!ScrFncObj::Call+0x8d 19 01bdcdac 75c72642 jscript!CSession::Execute+0xa7 1a 01bdce9c 75c724fe jscript!NameTbl::InvokeDef+0x179 1b 01bdcf1c 75c729d7 jscript!NameTbl::InvokeEx+0xcb 1c 01bdcf54 75c72947 jscript!IDispatchExInvokeEx2+0xac 1d 01bdcf8c 75c58fa8 jscript!IDispatchExInvokeEx+0x56 1e 01bdd01c 3cf0466b jscript!NameTbl::InvokeEx+0x2c5 1f 01bdd058 3cf0454f mshtml!CScriptCollection::InvokeEx+0x8c
If we take “CLinkElement” to be the implementation of the element represented by the LINK tag, then we can assume that the crash occurs while a LINK element is accessed in script (line 0x10). Furthermore, it seems that a “setAttribute” method call is causing the crash (line 0x0c). In addition, it seems that the attribute being set is a URL (line 0x06). Finally, while updating its attribute to the given URL the HandleLinkedObjects method (line 0x01) wants to iterate through a collection of children (line 0x00) and possibly notify these children, as well. It is here that the access violation occurs because of the NULL pointer.
If I examine my hook function, I can see that I never call “setAttribute” on a LINK element, and furthermore a complete investigation of the call stack will reveal that my code is not on the stack anywhere. Thus it is a crash that occurs after my hook has been called. I eventually determine that the crash occurs when a snippet of JavaScript in the Yahoo! UI (YUI) library dynamically creates LINK elements and then assigns their “href” property.
This is all interesting, but I’m still not closer to figuring out what causes the crash. It seems that the crash only occurs when specifically a LINK element is created via the createElement function. So I could “fix” the bug by adding code to my hook function to check whether the element created is a LINK element, and if so then do nothing and return. This is often the kind of thing you have to do if you want to ship software. But still it is not very satisfying. I want software problems to make sense, and this one is taunting me. I don’t like it.
Let’s step back and think about the general process of hooking. In setting up a hook, one needs to write code to 1) insert the hook into the “call chain”, write the actual hook function that 2) handles whatever event we are interested in and also 3)calls the next function in the chain, and (possibly) 4) write code to remove the hook. I think that it is easy to jump to the conclusion that a crash must be caused by either #1, #3, or #4. A lot of times when you’re writing hooks, it starts to feel like you’re not a proper software engineering citizen. If your hooking process involves swapping pointers, then you start to feel a little dirty and assume the worst. If your hook setup involves manually jmp-ing your instruction pointer to an address of your choosing and/or Virtual[Un]Protect-ing memory, then you really start to feel sick.
Of course, jumping to conclusions is not a great long-term strategy when debugging. It stands to reason that if we can first eliminate #2 as the culprit, then we can properly start blaming our hook and go from there. In my investigation above, it seems to be that #2 is where the fault lies. But it still doesn’t quite make sense, and since the crash occurs at a point after my hook function has returned, I don’t have solid evidence yet.
If you’ll notice, I have originally shown that
3rd party code = no crash
and
3rd party code + my hook = crash
If we break “my hook” into its four constituent parts, we can rewrite the second equation as
3rd party code + myhook1 + myhook2 + myhook3 + myhook4 = crash
Then I showed that by subtracting myhook2 (when I commented out the actual code in my hook function), there was no crash. Or
3rd part code + myhook1 + myhook3 + myhook4 = no crash
It would be more powerful, though, to add myhook2 to the first equation, and show that
3rd party code + myhook2 = crash
While this may not always be possible in every scenario, I believe I can try it here. My hook function manipulates the created element through mshtml interfaces, so if I can create an equivalent bit of JavaScript and have that same manipulation done outside of my code, then I can simply load the HTML/JS in the browser and look for a crash. If I find a crash and I see the same call stack in WinDbg, then the bug is not in my code. It’s great to be able to say this, but furthermore I’ll have the smoking gun!
It didn’t take me long to accomplish this. The first thing I do in my hook function is access the document of the element that was just created. This JavaScript code, therefore, will crash Internet Explorer for certain versions of mshtml. (It’s true: the execution of line 2 causes the crash in line 4!)
var li = document.createElement("LINK");
var dc = li.document;
li.setAttribute("rel", "stylesheet");
li.setAttribute("href", "foo.css");
I hope this provides you with a few lessons learned. First, investigating bugs can be fun and interesting, and you probably think so too since you’re reading this magazine. Secondly, when you modify software and see buggy behavior, the root cause may not be you. But lastly, you need to verify this before pointing fingers in any direction. I have presented here a framework for thinking about how to do just that.
About the Author
Philip Hamer has 10 years of experience developing and debugging applications on Windows platforms. He is currently a senior developer for a large tech company.