Software Hooks: Best Practices for Debugging

Philip Hamer, 28th of December, 2010
phiham[0x40]hotmail[0x2E]com

In the following article I will share my own recent experience debugging a software crash. Along the way, I’ll share and promote what I think is a good way to reason about these types of bugs.

Implementing software hooks is a time-honored technique for extending functionality of a third-party software component for which you have no source code. It is a fine addition to any clever programmer’s bag of tricks. Methods of implementing hooks can range from well-known “safe” techniques to down-right ugly-yet-elegant hacks. These can run the gamut from window subclassing (via SetWindowLongPtr) to the use of SetWindowsHookEx; from import address table hooking to “detour” or “trampoline” functions that rely on overwriting the first several bytes of a function in memory with a “jmp” instruction to your hook.

As for the specific bug I was investigating, it involved an application that used the Internet Explorer web browser control. The hook, in this case, was done through a standard COM interface method call, specifically the IDispatchEx interface. The purpose of the hook was to get notification of when a certain DOM method was called by a JavaScript function. There were reports of crashes, though, during internal testing. After some investigation it was determined that the crash was reproducible only in certain environments and only while navigating to a few specific web sites. So I open WinDbg in a “bad” environment, attach to our process, and navigate to a “bad” web site. I’m greeted with this.

(bd4.d20): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=00000000 ecx=00000000 edx=00000020 esi=01bda674 edi=07aeb804
eip=3cecf011 esp=01bda640 ebp=01bda644 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
mshtml!CChildIterator::CChildIterator+0x30:
3cecf011 8b4010          mov     eax,dword ptr [eax+10h] ds:0023:00000010=????????

We can see this is a NULL pointer dereference in the CChildIterator constructor in the mshtml module.

OK, so it has been reproduced. Now I modify my code so that my hooks are never “hooked up”, and I repeat the exercise. This time: no crash. I make another modification and reinstall my hooks, except this time I remove all the code in my hook function. Again: no crash. What is going on? It seems that there is something about my hook function that causes a crash within mshtml as it tries to construct a “CChildIterator” object. Let’s investigate the crash some more.

After more debugging, I find out that the specific hook responsible is my hooking of the “createElement” method. That is, I get notified whenever script calls createElement, and I can do whatever I want right after that in my hook function before returning control to the browser. If I look at the stack trace, I can attempt to infer a few things, as well.

0:005> k n 0x20
 # ChildEBP RetAddr  
00 01bda644 3d087128 mshtml!CChildIterator::CChildIterator+0x30
01 01bdc6c0 3cf8fd02 mshtml!CLinkElement::HandleLinkedObjects+0x23f
02 01bdc6d4 3cf78565 mshtml!CLinkElement::OnPropertyChange+0x91
03 01bdc730 3cf44ab9 mshtml!BASICPROPPARAMS::SetStringProperty+0x1d7
04 01bdc754 3cf8fd71 mshtml!BASICPROPPARAMS::SetUrlProperty+0x32
05 01bdc76c 3cf8fd44 mshtml!CBase::put_UrlHelper+0x24
06 01bdc788 3cf784b4 mshtml!CBase::put_Url+0x24
07 01bdc7b0 3cf586f8 mshtml!GS_BSTR+0x84
08 01bdc848 3cf74f63 mshtml!CBase::ContextInvokeEx+0x4ef
09 01bdc878 3d02f606 mshtml!CElement::ContextInvokeEx+0x70
0a 01bdc8ac 3ced638a mshtml!CLinkElement::ContextThunk_InvokeExReady+0x63
0b 01bdc8e0 3cf1adf0 mshtml!CBase::Invoke+0x6e
0c 01bdc97c 3cf1ad0d mshtml!CBase::setAttribute+0xc3
0d 01bdc9e4 3cf586f8 mshtml!Method_void_BSTR_VARIANT_oDoLONG+0xb8
0e 01bdca7c 3cf74f63 mshtml!CBase::ContextInvokeEx+0x4ef
0f 01bdcaac 3d02f606 mshtml!CElement::ContextInvokeEx+0x70
10 01bdcae0 75c729d7 mshtml!CLinkElement::ContextThunk_InvokeExReady+0x63
11 01bdcb18 75c72947 jscript!IDispatchExInvokeEx2+0xac
12 01bdcb50 75c731e5 jscript!IDispatchExInvokeEx+0x56
13 01bdcbc0 75c71c0a jscript!InvokeDispatchEx+0x78
14 01bdcc08 75c71211 jscript!VAR::InvokeByName+0xba
15 01bdcc48 75c711c6 jscript!VAR::InvokeDispName+0x43
16 01bdcc6c 75c7311d jscript!VAR::InvokeByDispID+0xfd
17 01bdcd24 75c71123 jscript!CScriptRuntime::Run+0x176c
18 01bdcd3c 75c70f8a jscript!ScrFncObj::Call+0x8d
19 01bdcdac 75c72642 jscript!CSession::Execute+0xa7
1a 01bdce9c 75c724fe jscript!NameTbl::InvokeDef+0x179
1b 01bdcf1c 75c729d7 jscript!NameTbl::InvokeEx+0xcb
1c 01bdcf54 75c72947 jscript!IDispatchExInvokeEx2+0xac
1d 01bdcf8c 75c58fa8 jscript!IDispatchExInvokeEx+0x56
1e 01bdd01c 3cf0466b jscript!NameTbl::InvokeEx+0x2c5
1f 01bdd058 3cf0454f mshtml!CScriptCollection::InvokeEx+0x8c

If we take “CLinkElement” to be the implementation of the element represented by the LINK tag, then we can assume that the crash occurs while a LINK element is accessed in script (line 0x10). Furthermore, it seems that a “setAttribute” method call is causing the crash (line 0x0c). In addition, it seems that the attribute being set is a URL (line 0x06). Finally, while updating its attribute to the given URL the HandleLinkedObjects method (line 0x01) wants to iterate through a collection of children (line 0x00) and possibly notify these children, as well. It is here that the access violation occurs because of the NULL pointer.

If I examine my hook function, I can see that I never call “setAttribute” on a LINK element, and furthermore a complete investigation of the call stack will reveal that my code is not on the stack anywhere. Thus it is a crash that occurs after my hook has been called. I eventually determine that the crash occurs when a snippet of JavaScript in the Yahoo! UI (YUI) library dynamically creates LINK elements and then assigns their “href” property.

This is all interesting, but I’m still not closer to figuring out what causes the crash. It seems that the crash only occurs when specifically a LINK element is created via the createElement function. So I could “fix” the bug by adding code to my hook function to check whether the element created is a LINK element, and if so then do nothing and return. This is often the kind of thing you have to do if you want to ship software. But still it is not very satisfying. I want software problems to make sense, and this one is taunting me. I don’t like it.

Let’s step back and think about the general process of hooking. In setting up a hook, one needs to write code to 1) insert the hook into the “call chain”, write the actual hook function that 2) handles whatever event we are interested in and also 3)calls the next function in the chain, and (possibly) 4) write code to remove the hook. I think that it is easy to jump to the conclusion that a crash must be caused by either #1, #3, or #4. A lot of times when you’re writing hooks, it starts to feel like you’re not a proper software engineering citizen. If your hooking process involves swapping pointers, then you start to feel a little dirty and assume the worst. If your hook setup involves manually jmp-ing your instruction pointer to an address of your choosing and/or Virtual[Un]Protect-ing memory, then you really start to feel sick.

Of course, jumping to conclusions is not a great long-term strategy when debugging. It stands to reason that if we can first eliminate #2 as the culprit, then we can properly start blaming our hook and go from there. In my investigation above, it seems to be that #2 is where the fault lies. But it still doesn’t quite make sense, and since the crash occurs at a point after my hook function has returned, I don’t have solid evidence yet.

If you’ll notice, I have originally shown that

3rd party code = no crash

and

3rd party code + my hook = crash

If we break “my hook” into its four constituent parts, we can rewrite the second equation as

3rd party code + myhook1 + myhook2 + myhook3 + myhook4 = crash

Then I showed that by subtracting myhook2 (when I commented out the actual code in my hook function), there was no crash. Or

3rd part code + myhook1 + myhook3 + myhook4 = no crash

It would be more powerful, though, to add myhook2 to the first equation, and show that

3rd party code + myhook2 = crash

While this may not always be possible in every scenario, I believe I can try it here. My hook function manipulates the created element through mshtml interfaces, so if I can create an equivalent bit of JavaScript and have that same manipulation done outside of my code, then I can simply load the HTML/JS in the browser and look for a crash. If I find a crash and I see the same call stack in WinDbg, then the bug is not in my code. It’s great to be able to say this, but furthermore I’ll have the smoking gun!

It didn’t take me long to accomplish this. The first thing I do in my hook function is access the document of the element that was just created. This JavaScript code, therefore, will crash Internet Explorer for certain versions of mshtml. (It’s true: the execution of line 2 causes the crash in line 4!)

var li = document.createElement("LINK");
var dc = li.document;
li.setAttribute("rel", "stylesheet");
li.setAttribute("href", "foo.css");

I hope this provides you with a few lessons learned. First, investigating bugs can be fun and interesting, and you probably think so too since you’re reading this magazine. Secondly, when you modify software and see buggy behavior, the root cause may not be you. But lastly, you need to verify this before pointing fingers in any direction. I have presented here a framework for thinking about how to do just that.

About the Author

Philip Hamer has 10 years of experience developing and debugging applications on Windows platforms. He is currently a senior developer for a large tech company.