Often when unexpected errors first occur, I've found problems in the way the error is handled. This isn't surprising: error handling code is often the least-exercised part of your system, so it's natural that there are bugs in it. It is important to fix the error handling problems first, then the original error.

Sometimes there is a whole chaing of problems started by the original error condition, then compounded as layers of error handling code mis-handle the condition. Fix all the problems, backwards.

Error handling code

In any program, there will be code specifically designed to handle unusual situations, whether they be errors, exceptions, failed assertions, edge conditions, odd input, whatever. The important thing about this "anomaly code" is that it is not often executed, making it a good hiding place for bugs. Often it can be difficult to test: how can you make sure there's a bad block in a disk file to test your error handling logic?

Estimates vary about what fraction of a large program is devoted to error handling, especially when different error handling techniques (status codes vs. exceptions, for instance) are taken into account. I think five to ten percent is a reasonable estimate in a large well-developed application. That's a good slice of code that doesn't get regularly tested.

Fix error handling first

When a problem occurs in your application, always check first that the error was handled appropriately. If it wasn't, always fix the handling code first.

Always fix error handling before fixing errors.

This rule is like the airline rule of putting an oxygen mask on yourself before putting one on your child. There are a few reasons for insisting on this order of work:

  1. With the original error in place, you have a perfect test case for the bug in your error handling code. Once you fix the original problem, how will you test the error handling? Remember, one of the reasons there was a bug there in the first place is that it is hard to test it.
  2. Once the original problem is fixed, the urgency for fixing the error handling code is gone. You can say you'll get to it, but what's the rush? You'll be like the guy with the leaky roof. When it's raining, he can't fix it because it's raining out, and when it isn't raining, there's no leak!

Cascading errors

In complex systems, we might be talking here about a whole chain of hidden bugs, all hidden until one problem triggers them, displaying them to you all at once. When that happens, analyze the chain, understand what went wrong at each step, and then fix the problems backwards.

Fix a chain of bugs backwards, from last observed problem back to the original cause.

This is the general statement of my original rule. In each cause-and-effect pair, don't fix the cause until the effect is correct. With a chain, the only effect you can fix first is the last one, because every other one is another effect's cause.

A real example

Here's an actual example that happened just the other day. The details have been changed for simplification (and to protect the innocent). Here's the log file:

1 [trace] Getting user id...
2 [crit!] Assertion failed: strlen(pUser->GetUserId()) > 0
3 [error] No handler for failed assertion
4 [trace] Prefs lookup: cache for  is empty
5 [trace] Querying full user info...
6 [trace] User  using  skin

The first thing wrong here is at line 2, where a critical error occurs: a failed assertion. The programmer thought it was impossible to have an empty string for a user id. He was wrong.

If it happens, it must be possible.

In the real bug this is based on, the cause was another server that was unavailable, but it doesn't matter: clearly somehow the user id was an empty string, so an assertion is the wrong way to deal with it. Bug 1.

At line 3, we see that the assertion infrastructure is not properly initialized, and there is no handler for failed assertions, so it is ignored. Bug 2.

At line 4, the preferences cache is being asked for information about the user, using the empty string as an id, and the cache is trying to find the user (see the double space in the phrase "cache for   is empty"? That's the empty user id in that message). The cache should know enough not to accept an empty string as a user id. Bug 3.

At line 5, the user's full information is queried, and at line 6, the results of the user query are back, and they aren't good. Not only has the user database gladly accepted an empty string as a user id, but it has returned empty strings back as the results. The log message here has two cases of double-space, showing the two places where empty strings are being used as words. Bug 4.

The wrong way to fix it

In this example, the simple thing to do is to find the server that was down, bring it back up and try again. The server will be available, the user id will be properly retrieved, it will find real results in the cache and the database, and all will be well. Right?

Wrong. You won't have fixed any bugs — our first bug wasn't that the server was down, it was that we mis-handled the server being down. The next time the server is unavailable, everything will go wrong again.

The right way to fix it

Work backwards, starting with the last problem. Fix the database so that it detects bogus ids, and returns some sort of error. Run the application again, and see that bugs 1 through 3 still happen, but number 4 has been fixed.

Work on bug 3: fix the cache so that it detects bogus ids, and returns some sort of error (perhaps throwing an exception). Run the app again, and see that bugs 1 and 2 still happen, but number 3 has been fixed. Notice that at this point, you can't tell if bug 4 is fixed or not: the bogus user id never gets to the database, so it never gets a chance to show that it can detect it. This is why we had to fix (and test!) bug 4 first. Bug 4 only happens when bugs 1 through 3 are still in the system.

Fix bug 2: make sure the assertion handler is properly initialized, or make sure that if it is not, something dramatic (like a dialog box or a thrown exception) happens.

Fix bug 1: don't assert that the user id can't be the empty string. Clearly it can. Throw an exception instead.

Now that all four of the bugs are fixed, you can do something about the real problem: the application didn't handle an unavailable server well. There's tons of ways to deal with this (retries, fail over, caches, redundant sources), but that's off the topic.

Putting it into practice

Working strictly backward like this can be difficult. For one thing, the four bugs could have ended up with different developers, and coordinating their efforts to be sure the bugs were fixed (and tested) in reverse order would have been too constraining. (In the real bugs my example was based on, the four bugs were handled by three different developers).

You may be under time pressure, and don't have the luxury of fixing all four bugs in turn. The boss is screaming for the system to start working again, and won't want to hear about some "theoretical" improvement. As far as he's concerned, everything's broken until the system is running again.

Some people would even argue that fixing all four bugs was unnecessary. Consider bug 4. Once the database client is fixed so that it never requests data about the empty user id, why does it matter what the database server does with an empty user id? So it's not a bug any more, right?

In the real world, there are all sorts of constraints and pressures (time, ownership, expertise, etc) that may convince you not to fix bugs. I understand that — we all have to ship product, and we're all only human. But the fact remains that bug 4 was a real bug. It may be latent, it may be hard to fix, it may be expedient not to fix it, but let's not kid ourselves: it is a bug. A bug you can't see is still a bug.

Fixing code is easiest when you can run it, and error handling code is only run when there are errors happening for it to handle. Broken error handling code is just a future mysterious accident waiting to happen. Fix error handling code before fixing errors.

See also

Comments

[gravatar]
Osiris 12:18 AM on 27 Oct 2003

Hmmmm...good point. One of those 'obvious once it has been mentioned' things that are called common sense and yet are so commonly overlooked.

[gravatar]
guy-in-corner 6:49 AM on 13 Nov 2003

Minor suggestion: In tracing frameworks, when printing a string, always put quotes around it.

e.g. 6 [trace] User '' using '' skin

This makes it much easier to tell that you've got an empty string, or if you've got leading or trailing spaces.

Maybe this should go in your Log message style guide?

[gravatar]
asdfasdf 12:49 AM on 12 May 2005

Bug number 4 is not necessarily a bug. Depends on the contract of the method. You shouldn't make EVERY method guard against EVERYTHING. Else you have too much redundancy.

[gravatar]
Pete Austin 4:28 AM on 23 Sep 2005

You missed Bug 5: the system is down. That's the one to fix first.

Later, at a quiet time, if you've really no alternative to doing development on your live system, you can fix Bugs 4 to 1.

[gravatar]
Ned Batchelder 6:49 AM on 23 Sep 2005

Pete, you are right, the server being down was the root cause. In a production environment, you would fix that problem first.

But in a development environment, it's the one to fix last. You need the server to stay down while you fix all the downstream problems it caused. That's the only way to know that you'll deal with a down server more gracefully next time.

[gravatar]
Andrew Grimm 7:45 PM on 19 Apr 2009

Rather than merely having a fix for bug 4 that's only testable until bug 3 is fixed (manually running the program), wouldn't it be better to have a test that involves knowingly sending the database a bad id, and testing that it returns an error? That way, you can be confident that any refactoring won't break the error handling.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.