Wednesday 17 September 2008 — This is more than 16 years old. Be careful.
At work we upgraded to the shiny-new Django 1.0, and we had to make a lot of small changes in the process. Most were what you would expect: adapting to the 1.0 way from the older 0.96 code we had been using.
But some of them were undoing ad-hoc patches to Django that we had accreted over the two years we’d been banging away at it. Over the course of a week or so, we’d found dozens of things broken, pointing to work yet to be done to finish the 1.0 upgrade, just as you’d expect. We have a large code base, and many things changed between 0.96 and 1.0.
Yesterday, I couldn’t log in on my dev server. Everyone else had been working just fine for the last few days, so it seemed mysterious. I asked our main Django guy Dave for help, and together we logged some session information, saw that there was no session being established at all. He realized what the problem was. “Oh, I changed SESSION_COOKIE_DOMAIN back to a string, we don’t use the list any more.” Turns out it was one of our ad-hoc Django changes that we threw overboard, and my settings file still had the old setting in it.
This is where the software should have diagnosed itself. If the settings/main.py file had these two lines added to it:
if isinstance(SESSION_COOKIE_DOMAIN, list):
raise Exception("SESSION_COOKIE_DOMAIN should be just a string now.")
Then I would have immediately gotten an exception on my server console (and browser) pointing to precisely what the problem was. I could have fixed it, and been running in two minutes, rather than being frustrated for half and hour, and bother Dave for another ten minutes.
Our development team is small (five), and all sit next to each other most days of the week, so the cost of this sort of out of band communication about changes to infrastructure is small. Also, I seem to have been the only developer who had a list in their settings file. So perhaps the cost here was a total of about an hour. Not so much, but adding those two lines in the first place would have cost about five minutes. And in addition to the five developers, there are probably five other “development environments” floating around for other purposes: intern work, demos, backups, evaluation tarballs sent to other groups, etc, and who knows if those will have the same problem.
And besides the simple time spent, there’s the loss of focus, the distraction of the other developers, the frustration, and so on. Developer attention is a very valuable resource. A speed bump like this in the road is like a CPU cache miss: your pipelines are flushed, and you have to re-focus. The time taken doesn’t tell the whole story.
Yesterday was just one of those days, because later, I was entering a zipcode into my dev machine, and was consistently told that there were no facilities near that zipcode, even though I knew there should be.
Turns out that somehow, my database table of zipcodes was empty. We still don’t know how that happened, but it would have been great if the software could have helped diagnose this anomalous condition. I changed this:
try:
z = ZipCode.objects.get(pk=zipcode)
except ZipCode.DoesNotExist:
raise KeyError
to this:
try:
z = ZipCode.objects.get(pk=zipcode)
except ZipCode.DoesNotExist:
if settings.DEBUG:
# Sometimes the problem isn't one bad zipcode, but that there
# are no zipcodes in the db at all!
if ZipCode.objects.all().count() == 0:
print "*** You have no zipcodes! Run bin/load_zipcodes.py"
raise KeyError
It would have been another half-hour saved. I don’t know how the zipcodes were deleted, so it’s hard to guess how often someone will be in this position again, but I know it is worth it to add these sorts of diagnostics. I’ll take a guess that the next time the zipcodes are missing will be five minutes before a critical demo, when everyone is panicky and no one will be able to think through the possible causes clearly. An unambiguous diagnostic will be very welcome.
Take the time to make your software self-diagnosing. The more you can automate about the job of writing software, the better your software will be.
Comments
after all migrating to a different release is supposed to break lots of things but happens so rarely that is simply not worth to add lots of diagnostic code that will never get triggered
it may also be that you need better testing rather than self diagnosis
1. it tends to clutter up our codebase. Your first example added 2 lines (no idea how big the other code is), your second adds 2 lines to 3 that were already there (to be fair, I'll skip the comments in my calculation). And as we know, more code = more errors.
2. in the general case we can not test for all possible failure conditions. So we will only diagnose the "known" error conditions. Worse: The more diagnostics we add, the more failure conditions can arise.
For an example, in your first code snippet, your test is wrong: You should test for somehting being no string, not something being a list. By the way, static typing would have solved this easily (Even optional static typing might have caught this).
You might consider that this would have been covered by a basic 'smoke test'. That is, a basic system test that attempts to run a simple use case and exercises a significant part of the functionality. Any tests are far better than none.
The other option is concise syntax. Somewhere you have a "Required_Records", "All_Meet_Regex" and other constraints so that you concisely describe the preconditions to running the code.
Code that has its own integral diagnostics should be limited to code that can take action on those diagnostics.
(or to get radically lazy, designate some tests as 'canonical' and automatically report when the size of any list or table is much different than in that test. sounds crazy, but it might have other uses too, eg automatic detection of tables that should be read in pieces because the user is really just waiting for the first 20 rows to fill a list in the gui.)
but if you cant change the architecture today, then maybe what you wrote is pretty ideal because it's ugly enough to act as a reminder to future developers that bigger changes are needed. perhaps adding an 'architectural todo' tag would make this wart more findable when/if the time comes.
(on the other hand, i cant begin to count the number of times i've written a good 'todo' comment, then decided to actually do it and found that doing it didnt take much longer than writing the todo! not that the todo time was wasted--it probably helped to clarify what i needed to do.)
I don't know Django that well, but try to handle these cases as follows:
Write upgraders that register themselves for specific target versions, and have as many ugly conditionals as you need in there. Try to write upgraders that will not break stuff but pass silently when encountering already upgraded content, and you're golden.
The upgraders also tend to become an interesting view on how things progress over time. ;)
I'm not advocating writing diagnostic code for every possible failure condition. I understand that that is an impossible undertaking. But the very real failure cases that I encountered are potential potholes for other developers as well, and diagnostics to help them avoid baffling failures are a good thing.
Yes, these lines of code add complexity, but it's complexity that is worth it. Too often a failure happens, and the developer leans out of his cube to ask around what might be the cause, and another dev says, "Oh yeah, you have to fix it like this, I knew about that". This is waste, and if it can be avoided with two or three lines of code, it is well worth it.
It's all well and good to say that every line of code is a potential source of problems, but that doesn't mean we should write lines of code. We just have to write them with good purpose in mind.
There could have been better places to add the ZipCode check, certainly, and maybe I should have thought harder about where to put it.
My point was more that the specific codebase changes you just made were (hopefully) a one time change. Having code in your application 'forever' to check whether that change was successfully applied seems wasteful to me.
On the ZipCode, are you also going to check that all items have only digits, not alphas? That all are size=5? That the table exist? Of course not. That is what design-by-contract is for. Either you have something in the architecture to handle it, or you spend time every now and then when the things go unexplicably wrong.
There's no upgrade script that would have handled this. I said we don't know how the zipcodes got deleted. Should I write an upgrade script that checks for all possible inexplicable deletions of data? The fact is, this failure happened, and cost me time. I'll gladly pay three lines of code to guarantee it doesn't happen again.
I'm not advocating defensive programming against all imaginable failures. I'm talking about observing failures, and then ensuring that those specific failures don't happen again.
Ned: I have a problem, here is the solution, you should all do this.
Crowd: No thanks, your solution sucks, doesn't really solve the problem, which you don't really understand anyway, and it leads to even worse problems.
Ned: I disagree, my way is best, I don't care about the downside.
Without knowing much about the actual subject at hand here, I'll agree with the crowd. Ned, if you don't even know why the zipcodes disappeared, how can you strongly advocate a particular fix? And: how do you know that you won't find the the potential increased errors, when you inevitably encounter them, to be even more bothersome than your current problems?
SAFTI*,
Paul
*Smug advice from the ignorant
The first problem (list was converted to string) would have been handled by an upgrade script which would open old data and store it in new format.
The second (ZipCode) problem of unexplicable disappearing ZipCodes would not be fixed by an upgrade script, and would probably be missed by endless checks.
I refered to the ZipCodes problem when saying that either an architectural solution is needed, or one developers spends one hour fishing for the solution every couple of months. I understand your frustration, but spending one extra hour is not the worst thing in the world.
Hope that this clears up what I meant to say.
@Zoran: the problem with lists was not in the database, or in a data file. It was in a settings.py file. We have no infrastructure for automatically editing those files, and I'm not sure I'd want one. And which files would it look for? I have a few private settings files that are not checked into source control because they are particular to my machine. If you think a two-line check is error-prone and hard to maintain, what will you think of an upgrading infrastructure that modifies private source files on developers' machines? It sounds like a nightmare to me.
You may not mind spending an hour every few months. I don't mind carrying two lines of code to avoid it.
I had 'SESSION_COOKIE_DOMAIN = ( A, B, C )' in my settings file, your 'fix' would not have caught my problem.. Though I feel your pain. In a perfect world, Django would have taken a tip from the greater python world and had a staging release with deprecation warnings. This of course would mean that that release was a little bit slower and clunkier, but would help with the migration with more useful hints. The counter argument is that this was for getting to the 1.0 release and deprecation warnings will be used going forward now that a proper baseline has been established.
In general I have to agree with all the masses. Have proper tests for your zipcode stuff. I am a firm believer is stupidity driven testing. When something like this comes up, add a test for it. Have a problem? run the tests and see what shakes out. If the tests don't fail, then you have 2 bugs (the second that there is no test for this bug). This keeps the code maintainable. That is not to say that intelligent error reporting should not be used, just that in general defensive programming on the level you are describing here is a very very bad practice.
Speaking as a woodworker, there are a few practices that I use that my 'betters' would not, and I really should stop doing them; especially when it comes to furniture (but drywall screws are cheap and available in so many useful lengths).
Add a comment: