

Dan Skwire, 6th of November, 2010
Author of First Fault Software Problem Solving
IT IS AN EMERGENCY!
It's an emergency! No, not at the data center, but YOU feel terrible. Physically. In the USA, Canada, and other places, you dial 911. Other countries have their own short telephone numbers.
So you dial, and you get someone on the other end who is trained to ask you the fewest questions to find out exactly what you need, and get you in touch with an expert (or in places like France, a real doctor) who can help you remotely, or guide and inform an emergency group of paramedics in a well-equipped vehicle to your precise location.
SUCCESS IN THAT FIRST HOUR!
Once the paramedics arrive, they quickly use instruments on board to diagnose your condition, inform their destination hospital of that condition, and perhaps get additional second-level medical help regarding how to treat and/or further diagnose your situation, which can optimize your care and likelihood of return to physical health. In a 2004 popular magazine article, Reader's Digest magazine described a very efficient and successful 911 operation in the USA city of Baltimore, with 8 helicopters airborne all the time, ready to efficiently transport people to the emergency room. The 10 emergency rooms at the destination hospitals are all identically equipped with the same equipment and instruments, arrayed the same way, so all medical personnel can successfully and efficiently perform their duties in any of the 10 rooms. Their goal is to continue the 97% survival rate they've attained, by paying attention to efficiently perform their duties in the first, critical hour of support. They call it their ‘golden hour’.
WHAT DOES THIS HAVE TO DO WITH COMPUTER PROBLEM SOLVING?
Well, great, Dan. Who cares about medical support? What relationship does that have to computer problem support, the field you're such an expert in? PLENTY!
IMPROVED OVER THOUSANDS OF YEARS
The medical field has paid great attention to problem resolution, if I may call it that and to perform it both rapidly and accurately. My own recent hospital experience and that of my wife, has, for me, highlighted, a very efficient problem solving organization. And medical people do not depend on 'do-overs' (problem recreation), do they? The hospital staff depends on expert problem diagnosis, with expert analysis via very valuable effective hardware. They use instruments to obtain vital signs, have great statistics regarding the meaning of vital signs, and immediately hook up patients to real-time continuous monitoring of those vital signs.
BEST PROACTIVE PRACTICES FOR GOOD HEALTH
Proactively, they recommend best practices to prevent problems (exercise, diet, and regular physical measurements of the state of your body - blood tests, diagnostic tests like colonoscopy, breast self-examination, and regular physical checkups).
COMPUTER BEST PRACTICES
With mechanical equipment, in the computer field, 'preventive maintenance' was a common term, but it still is relevant as far as keeping software maintenance updated, and health-checks performed upon the status of particular system components.
SOFTWARE BEST PRACTICES
In the computer field, we need to perform proactive monitoring (real-time 'vital sign' monitoring of CPU, I/O, storage, and network performance 'vital signs'). We can surely solve performance problems if we track those vital signs ongoing, before a performance problem. This data-collection makes it so easy to do back-tracking and forensic research, all valuable tools in uncovering the root of a performance issue.
FIRST SIGNS OF COMPUTER PROBLEMS – MONITORING ‘VITAL SIGNS’
We benefit by real-time monitoring of exceptions to normal conditions - Tivoli alerts, and other real-time notifications via phone-home through telephones or the internet.
FITTING PEOPLE WITH INSTRUMENTS – 'INSTRUMENTING' PEOPLE?!
How about 'instrumenting' people with specific monitors as they go about their daily business, when they are not in a hospital setting? Yes, some people do have that, often with some known but unsolved condition. Should we instrument computers, especially computer software, with ‘data and condition’ collection, taking vital signs, so as to be able to reconstruct a problem, without having to do a 'do-over' (problem recreation)?
CAN YOU INSTRUMENT A COMPUTER OR SOFTWARE, BEFORE A PROBLEM OCCURS?
Yes, we can and should do that - we have a variety of monitors and activity trace tables, so we can perform a microscopic trace of program flow, in case there is some kind of problem. And often, substantial tracing can be done non-disruptively, without adversely affecting system performance. You KNOW you're going to have a problem sometime. Why not use computer software to collect diagnostic information, BEFORE the problem occurs? Afterward, it’s often too late to collect data, that when the vital information describing the root cause of the problem is already overlaid. Why not collect that data, the state of the programs, and important variables, just like you (likely already) collect performance data, before your problem occurs?
WHAT HAPPENS AT YOUR COMPUTER SITE? CAN YOU DO THIS ‘AT HOME’?!
How about you? Does your site anticipate computer problems, build and activate instruments in case of a problem, like medical facilities with permanent 'emergency services' staffs and facilities? Or, do you anticipate nothing, and just hope that there are no problems, and then assemble what I used to call a 'pick-up' team when there's an emergency? In this scenario, your site quickly calls together experts, forms an organization, and plans the data you need to collect, ONLY when a problem occurs.
DIAGNOSE QUICKLY AND ACCURATELY, LIKE EMERGENCY MEDICAL PERSONNEL
Why not do it like the medical field, with trained emergency diagnostic experts who:
WHAT DOES AN OUTSIDER SEE WHEN YOUR ORGANIZATION WORKS A CRITICAL PROBLEM?
Does your computer staff look like a world-class medical emergency team, saving lives, or are you forced to handle emergencies 'on-the-fly', looking not so professional in the process?
RECOMMENDATION: INSTRUMENT YOUR PRODUCTS BEFORE PROBLEMS OCCUR
And please, do build your software with ongoing instrumentation, to collect data that your problem-solving technicians say they always need. Capture that data, and send it back via real-time telemetry. Fragments of that data will surely help match an error condition with a prior instance of the same problem (problem matching via signature) and it will undoubtedly be of value to help solve a new and unique problem. This is data that you will ultimately need anyway. Why wait to collect it until the early minutes or hours of a real-time outage?
LET’S CAPITALIZE ON THE MEDICAL COMMUNITIES TECHNIQUES!
In short, the computer community is not the first community to need to solve complex problems when they first occur. We have been preceded by very successful medical communities of diagnostic experts, who have perfected their skills over thousands of years. Why not build communities of computer problem solvers according to their model? And surely, one thing we can do, that medical people cannot do, is surely ‘instrument’ our patients (computer programs and computer systems) to collect diagnostic data in anticipation of problems, that we too, know will occur.