

Sathish Venkataraman, 21st of July, 2010
Winner of Tell Your Windows Debugging Story Annual Competition
This was one of tough problems that we faced a few years back when I was with Computer Associates Inc., NY.
One of our customers was running our application (as a Windows service) on about 700+ servers at different locations in US. They always scheduled a reboot of all 700+ machines almost at the same time every Sunday noon. When they did that our service on some half-a-dozen of these servers would go into a weird state where it would not respond to service requests but a restart of the service worked. They couldn’t turn on logging on all the servers as it produced a lot of log content and we couldn’t give them binaries with additional tracing as they would have to deploy it on all the servers. The default log we had didn't help but if they could have made the tracing level higher that would help. Again, tracing switch couldn’t be turned on dynamically because the code didn’t refresh it until restart.
The customer definitely didn't like the idea of disturbing 700+ servers with an instrumented binary just for troubleshooting this issue. After fighting with this issue for weeks, we decided on the following plan. The other hurdle was that the customer didn't have access to GoToMyPC or WebEx like services on their production environment hence we had to communicate to their systems personnel over the phone for performing the following:
About the Author
Sathish Venkataraman has a total experience of 17 years and is currently working with FaceTime Communications India Pvt Ltd., Bangalore, India.