Self-diagnosing software

Wednesday 17 September 2008

At work we upgraded to the shiny-new Django 1.0, and we had to make a lot of small changes in the process. Most were what you would expect: adapting to the 1.0 way from the older 0.96 code we had been using.

But some of them were undoing ad-hoc patches to Django that we had accreted over the two years we’d been banging away at it. Over the course of a week or so, we’d found dozens of things broken, pointing to work yet to be done to finish the 1.0 upgrade, just as you’d expect. We have a large code base, and many things changed between 0.96 and 1.0.

Yesterday, I couldn’t log in on my dev server. Everyone else had been working just fine for the last few days, so it seemed mysterious. I asked our main Django guy Dave for help, and together we logged some session information, saw that there was no session being established at all. He realized what the problem was. “Oh, I changed SESSION_COOKIE_DOMAIN back to a string, we don’t use the list any more.” Turns out it was one of our ad-hoc Django changes that we threw overboard, and my settings file still had the old setting in it.

This is where the software should have diagnosed itself. If the settings/main.py file had these two lines added to it:

if isinstance(SESSION_COOKIE_DOMAIN, list):
    raise Exception("SESSION_COOKIE_DOMAIN should be just a string now.")

Then I would have immediately gotten an exception on my server console (and browser) pointing to precisely what the problem was. I could have fixed it, and been running in two minutes, rather than being frustrated for half and hour, and bother Dave for another ten minutes.

Our development team is small (five), and all sit next to each other most days of the week, so the cost of this sort of out of band communication about changes to infrastructure is small. Also, I seem to have been the only developer who had a list in their settings file. So perhaps the cost here was a total of about an hour. Not so much, but adding those two lines in the first place would have cost about five minutes. And in addition to the five developers, there are probably five other “development environments” floating around for other purposes: intern work, demos, backups, evaluation tarballs sent to other groups, etc, and who knows if those will have the same problem.

And besides the simple time spent, there’s the loss of focus, the distraction of the other developers, the frustration, and so on. Developer attention is a very valuable resource. A speed bump like this in the road is like a CPU cache miss: your pipelines are flushed, and you have to re-focus. The time taken doesn’t tell the whole story.

Yesterday was just one of those days, because later, I was entering a zipcode into my dev machine, and was consistently told that there were no facilities near that zipcode, even though I knew there should be.

Turns out that somehow, my database table of zipcodes was empty. We still don’t know how that happened, but it would have been great if the software could have helped diagnose this anomalous condition. I changed this:

try:
    z = ZipCode.objects.get(pk=zipcode)
except ZipCode.DoesNotExist:
    raise KeyError

to this:

try:
    z = ZipCode.objects.get(pk=zipcode)
except ZipCode.DoesNotExist:
    if settings.DEBUG:
        # Sometimes the problem isn't one bad zipcode, but that there
        # are no zipcodes in the db at all!
        if ZipCode.objects.all().count() == 0:
            print "*** You have no zipcodes! Run bin/load_zipcodes.py"
    raise KeyError

It would have been another half-hour saved. I don’t know how the zipcodes were deleted, so it’s hard to guess how often someone will be in this position again, but I know it is worth it to add these sorts of diagnostics. I’ll take a guess that the next time the zipcodes are missing will be five minutes before a critical demo, when everyone is panicky and no one will be able to think through the possible causes clearly. An unambiguous diagnostic will be very welcome.

Take the time to make your software self-diagnosing. The more you can automate about the job of writing software, the better your software will be.

Comments

[gravatar]
It's a Van 11:10 AM on 17 Sep 2008

okay but if you were to add "self-diagnostic" to every feature that can go wrong you'd end up with a pretty ugly code

after all migrating to a different release is supposed to break lots of things but happens so rarely that is simply not worth to add lots of diagnostic code that will never get triggered

it may also be that you need better testing rather than self diagnosis

[gravatar]
Andre 11:36 AM on 17 Sep 2008

IMHO there are two problems with diagnostic code:
1. it tends to clutter up our codebase. Your first example added 2 lines (no idea how big the other code is), your second adds 2 lines to 3 that were already there (to be fair, I'll skip the comments in my calculation). And as we know, more code = more errors.

2. in the general case we can not test for all possible failure conditions. So we will only diagnose the "known" error conditions. Worse: The more diagnostics we add, the more failure conditions can arise.

For an example, in your first code snippet, your test is wrong: You should test for somehting being no string, not something being a list. By the way, static typing would have solved this easily (Even optional static typing might have caught this).

[gravatar]
Charles Merriam 12:07 PM on 17 Sep 2008

My rule has always been "every token or documentation string adds an equal maintenance load". Your diagnostic adds about 22 tokens to your main code which will need to be correct and touched when the zip code interface changes. Also, you are writing the diagnostic for that one problem, which with the benefit of hindsight, is the exact case you thought of. If zip codes were suddenly the wrong length, or non-digits, or stored with the wrong "dash in zip+4" convention, you would need more diagnostics. Most practitioners put this in testing routines.

You might consider that this would have been covered by a basic 'smoke test'. That is, a basic system test that attempts to run a simple use case and exercises a significant part of the functionality. Any tests are far better than none.

The other option is concise syntax. Somewhere you have a "Required_Records", "All_Meet_Regex" and other constraints so that you concisely describe the preconditions to running the code.

Code that has its own integral diagnostics should be limited to code that can take action on those diagnostics.

[gravatar]
johnS 12:44 PM on 17 Sep 2008

sounds like the pain you went thru put you in a very different place from van and andre. there is something to address here, but i agree with them that it doesnt belong where you put it. i think it belongs in the broader architecture, which for example might include the notion that some tables should never be empty. your ZipCode could perhaps inherit from some NeverEmptyTable (maybe it's really just a mixin or even lighter) and perhaps the table is checked only upon first instantiation of the class.

(or to get radically lazy, designate some tests as 'canonical' and automatically report when the size of any list or table is much different than in that test. sounds crazy, but it might have other uses too, eg automatic detection of tables that should be read in pieces because the user is really just waiting for the first 20 rows to fill a list in the gui.)

but if you cant change the architecture today, then maybe what you wrote is pretty ideal because it's ugly enough to act as a reminder to future developers that bigger changes are needed. perhaps adding an 'architectural todo' tag would make this wart more findable when/if the time comes.

(on the other hand, i cant begin to count the number of times i've written a good 'todo' comment, then decided to actually do it and found that doing it didnt take much longer than writing the todo! not that the todo time was wasted--it probably helped to clarify what i needed to do.)

[gravatar]
eric casteleijn 12:47 PM on 17 Sep 2008

Having had to maintain some code bases that are older than the typical web application, I can only say: don't do this, you'll thank yourself later. Having defensive code like this all over the place makes it less clear what the code is supposed to do.

I don't know Django that well, but try to handle these cases as follows:

Write upgraders that register themselves for specific target versions, and have as many ugly conditionals as you need in there. Try to write upgraders that will not break stuff but pass silently when encountering already upgraded content, and you're golden.

The upgraders also tend to become an interesting view on how things progress over time. ;)

[gravatar]
Ned Batchelder 1:13 PM on 17 Sep 2008

Well, looks like I'm the minority here!

I'm not advocating writing diagnostic code for every possible failure condition. I understand that that is an impossible undertaking. But the very real failure cases that I encountered are potential potholes for other developers as well, and diagnostics to help them avoid baffling failures are a good thing.

Yes, these lines of code add complexity, but it's complexity that is worth it. Too often a failure happens, and the developer leans out of his cube to ask around what might be the cause, and another dev says, "Oh yeah, you have to fix it like this, I knew about that". This is waste, and if it can be avoided with two or three lines of code, it is well worth it.

It's all well and good to say that every line of code is a potential source of problems, but that doesn't mean we should write lines of code. We just have to write them with good purpose in mind.

There could have been better places to add the ZipCode check, certainly, and maybe I should have thought harder about where to put it.

[gravatar]
eric casteleijn 1:26 PM on 17 Sep 2008

You can't always avoid defensive code, and the examples you give are certainly better than most I've encountered, in that they still raise an error, just a more informative one.

My point was more that the specific codebase changes you just made were (hopefully) a one time change. Having code in your application 'forever' to check whether that change was successfully applied seems wasteful to me.

[gravatar]
Charles Merriam 1:34 PM on 17 Sep 2008

The best place to put the zip code check is in your test suite. In a perfect world you have a test that says:

   ReadAddressCheck Test Failed
         assert KeyError "No Zip Code"
         
         Previous Programmers Preach:
             If zipcode database is missing, run bin/load_zipcodes.py
             Transistion to unicode causes old zips to break; reload.
             Run bin/valid_data -f zipcode to see if zipcodes match regex
        

[gravatar]
Marius Gedminas 2:59 PM on 17 Sep 2008

Interesting. I completely agree with Ned's point here, but after seeing those code examples my first reaction also was "no! don't do that!". I'm unable to justify that reaction with any arguments.

[gravatar]
Fredrik 7:19 AM on 18 Sep 2008

I tend to use a separate "sanity" check test suite for cases like this (built using my usual unit test tools). The sanity checks look for common build & installation problems, configuration issues, basic database sanity, and all kinds of silly but hard-to-find issues that appear over a project's life time (including memory leaks and other bugs in third-party software), and are designed to be run both as part of the usual test suites, *and* to verify that a new installation is sane. Has saved my own sanity many times.

[gravatar]
Zoran Lazarevic 12:42 PM on 18 Sep 2008

I was about to write "don't do this", but others have already stated that. I would say that your team needs to write upgrade scripts, that will correct the environment left over by the previous build. That would have converted your list to string.
On the ZipCode, are you also going to check that all items have only digits, not alphas? That all are size=5? That the table exist? Of course not. That is what design-by-contract is for. Either you have something in the architecture to handle it, or you spend time every now and then when the things go unexplicably wrong.

[gravatar]
Ned Batchelder 2:10 PM on 18 Sep 2008

@Zoran, you argue that I can't cover all the possible failure modes, and you are of course right. But then you say I should cover this failure in an upgrade script. You still have the problem of not being able to cover all the failure cases.

There's no upgrade script that would have handled this. I said we don't know how the zipcodes got deleted. Should I write an upgrade script that checks for all possible inexplicable deletions of data? The fact is, this failure happened, and cost me time. I'll gladly pay three lines of code to guarantee it doesn't happen again.

I'm not advocating defensive programming against all imaginable failures. I'm talking about observing failures, and then ensuring that those specific failures don't happen again.

[gravatar]
Paul Downs 3:55 PM on 18 Sep 2008

Wisdom of crowds vs. Ned.

Ned: I have a problem, here is the solution, you should all do this.

Crowd: No thanks, your solution sucks, doesn't really solve the problem, which you don't really understand anyway, and it leads to even worse problems.

Ned: I disagree, my way is best, I don't care about the downside.

Without knowing much about the actual subject at hand here, I'll agree with the crowd. Ned, if you don't even know why the zipcodes disappeared, how can you strongly advocate a particular fix? And: how do you know that you won't find the the potential increased errors, when you inevitably encounter them, to be even more bothersome than your current problems?

SAFTI*,

Paul

*Smug advice from the ignorant

[gravatar]
Zoran Lazarevic 6:09 PM on 18 Sep 2008

Ned, you had two problems:
The first problem (list was converted to string) would have been handled by an upgrade script which would open old data and store it in new format.
The second (ZipCode) problem of unexplicable disappearing ZipCodes would not be fixed by an upgrade script, and would probably be missed by endless checks.
I refered to the ZipCodes problem when saying that either an architectural solution is needed, or one developers spends one hour fishing for the solution every couple of months. I understand your frustration, but spending one extra hour is not the worst thing in the world.
Hope that this clears up what I meant to say.

[gravatar]
Ned Batchelder 6:36 PM on 18 Sep 2008

@Paul: you're a craftsman (for those who don't know, Paul makes furniture). Is there anything you do differently than your peers? Why?

@Zoran: the problem with lists was not in the database, or in a data file. It was in a settings.py file. We have no infrastructure for automatically editing those files, and I'm not sure I'd want one. And which files would it look for? I have a few private settings files that are not checked into source control because they are particular to my machine. If you think a two-line check is error-prone and hard to maintain, what will you think of an upgrading infrastructure that modifies private source files on developers' machines? It sounds like a nightmare to me.

You may not mind spending an hour every few months. I don't mind carrying two lines of code to avoid it.

[gravatar]
Doug Napoleone 7:46 PM on 18 Sep 2008

Ned,

I had 'SESSION_COOKIE_DOMAIN = ( A, B, C )' in my settings file, your 'fix' would not have caught my problem.. Though I feel your pain. In a perfect world, Django would have taken a tip from the greater python world and had a staging release with deprecation warnings. This of course would mean that that release was a little bit slower and clunkier, but would help with the migration with more useful hints. The counter argument is that this was for getting to the 1.0 release and deprecation warnings will be used going forward now that a proper baseline has been established.


In general I have to agree with all the masses. Have proper tests for your zipcode stuff. I am a firm believer is stupidity driven testing. When something like this comes up, add a test for it. Have a problem? run the tests and see what shakes out. If the tests don't fail, then you have 2 bugs (the second that there is no test for this bug). This keeps the code maintainable. That is not to say that intelligent error reporting should not be used, just that in general defensive programming on the level you are describing here is a very very bad practice.

Speaking as a woodworker, there are a few practices that I use that my 'betters' would not, and I really should stop doing them; especially when it comes to furniture (but drywall screws are cheap and available in so many useful lengths).

[gravatar]
Paul Downs 9:14 AM on 19 Sep 2008

Ned: you know me. My preference is always to think a problem through and come up with my own solutions. This is often quite helpful, but one thing that I am paying greater attention to as I get older is that there is a huge cost associated with this - if all my competitors are doing something, it's smart to think why that might be. Non-standard approaches to standard problems may be technically elegant, but often have a short term development and/or implementation cost associated with them that the market isn't willing to bear, even if it leads to greater efficiencies in the future. I have been caught in this trap many times: making a fabulous product that clients aren't willing to pay for, because they don't value the upgrade the same way I do.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.