Database schema upgrades

Sunday 18 December 2005This is 19 years old. Be careful.

One of the trickier aspects of building applications on relational databases is upgrading the database schema. As features get added to the application, columns and tables get added to the database. At some point, you want to install the new code onto an existing installation. How do you change the database schema to add the columns and tables without losing the existing data?

In my last two startups, I was put in charge of figuring this out. The two designs I came up with had some differences, because there were different requirements and constraints, but they had these aspects in common:

  • Revisions of the database schema were assigned serial numbers by the developers who made changes.
  • The developer had to manually write code (an update clause) to change the database schema, though there were helpers to isolate him from as many messy details as possible.
  • The database recorded which schema number it currently had.
  • When started, the application would compare the current schema number in the database with the current schema number in the code. If they differed, it would find and run the proper update clauses in sequence to bring the database up to date.

We’ll have this same problem in my next job, though this time it will be on top of the Django framework. The Django dudes understand the issues (the first three comments on the Model Reference documentation page are about automatically upgrading databases), but may not tackle it for 1.0.

There have been at least a few attempts at solutions, but I think they are heading down the “too automatic” path.

What I found doing this in the past in real applications was that database upgrades are too complicated to be done automatically. The simple cases (adding a nullable column, dropping a column) can be made very simple. For example, the developer writes an upgrade clause that includes a call to a DropColumn function. But the complicated cases are always going to be beyond the reach of automatic generation. Real upgrades include things like taking an existing text column and splitting it into rows in a new foreign-keyed table, or adding a non-nullable column with existing values being computed from other tables. Upgrades also include stuff like realizing your application had been writing invalid data for a while, and running update statements to fix the bad rows.

The real world is messy. There’s just no way to leave the developer out of the solution.

Comments

[gravatar]
Interesting Ned. I think most dbs need to go through these schema evolutions over time, and its often tricky and error prone. I'm guessing one-off, custom dbs have this problem much more often than commerical products.

I think a good solution might be to forget relational dbs altogether and go with a schema-less Db system. Of course, since I'm writing my own schema-less db system (CouchDb), perhaps I'm slightly biased.
[gravatar]
We discussed that briefly on the django IRC channel last night, and in the comments to the same Django model API page I found a pointer to this article, Managing Database Changes, which details a sistem like the one you described, but for WebObjects.
[gravatar]
One of the things I've always wanted is a SQL diff tool, so I can generate the update statements automatically. In practice, what we end up doing at work is every time a change goes into code which requires different database structure, the commit message has "WARNING! SQL CHANGE!" and a description of the change (create new table "foo", run sp_autopopulate_foo, that sort of thing). I'd love to bring SQL under source control in the same way, but without interrupting our current workflow; making everyone go back to using written SQL rather than graphical tools for everything is a big interruption to workflow and so I'm loath to do it. Some kind of fix for this has been percolating around in my head for about five years, but I've never done it...
[gravatar]
Damien: Certainly a schema-less database reduces the need for this sort of update, but you may still need a framework for updating existing documents to new application requirements. In any case, the need for an updater will not be enough to push a relational application over to a non-relational database.

Antonio: I saw the Stepwise paper yesterday, and it is quite similar to the systems I have built. For a couple of reasons, I wouldn't require updates to be strictly sequential. That is, a database at schema 37 could update to schema 61 in a single step.

Stuart: I can't imagine working in an environment where developers have to manually apply SQL updates when pulling new code from source control. But a SQL diff tool won't get you very far. It might be a helpful starting point, but a diff of your Django models or whatever else is truly your source form of the schema would probably work as well. And there will always be a need for hand-crafting update clauses.
[gravatar]
What you describe is very similar to what the sqlobject-admin tool does. Basically it dumps a complete set of the "expected" CREATE statements into different directories (the database versioning is expected to be orthogonal to normal source control versioning), then presents the developer with a diff and asks them to create the appropriate update statement. Then the database itself holds a table with a row saying what version it is at.

The database is just at a single version, but you can also provide scripts that upgrade skipping intermediate versions. This way you can effectively cull a version if you mess up the upgrade but haven't applied it to some versions of the database.

If I was going to improve it, I might try to better guess what the SQL updates are, and present the programmer with that to edit. But I agree that an explicit confirmation and potentially editing of the SQL is always necessary.
[gravatar]
Hi, I'm a django developer. I do have a plan to do this before 1.0 if possible, probably after the magic-removal branch merges to trunk, but maybe during it.

The idea is that the old versions of the models will be checkpointed somewhere, with a simple copying scheme builtin, and plugins for stuff like subversion. The plan is that transitions will be created in an interactive process (via the webserver, probably), but can also be edited to add unusual bits. I plan to allow at the least field type transitions, combining, splitting, and hopefully things like count-field->entity transitions.
[gravatar]
This is one issue that most frameworks completely ignore. It's a hard problem to solve and not considered "sexy" work. Also when you have to deal with multiple relational databases the upgrade mechanisms vary. For example DB2 has *no* scripting support. And dropping a column with DB2 basically requires dropping and recreating a table. What a pain!
[gravatar]
...and SQLite doesn't support DROP COLUMN, ALTER COLUMN, or ADD CONSTRAINT.
[gravatar]
rjwittams: is there someplace where interested parties can kibbitz on your ideas before they are implemented?

Robert: I know all about SQLite's limitations. The system I built for Kubi Software included SQLite support, and I had to code around those limitations. I had a DropColumn() method that would do the crazy dance to recreate the table (with indexes), but without the column-to-be-dropped. It was a pain, but fascinating to get working right.
[gravatar]
It just so happens that Axiom implements almost exactly the solution you've proposed. It's subtly different in a few ways, but you should have a look at it! (Man, I really need to write some docs for upgraders...)

Your core insight, though, is totally correct. It's impossible to do this automatically. The best you can do is provide a nice hook to make it simple for a developer to write code that will be run at the appropriate time.
[gravatar]
I wrote an open source program trying to solve this very problem, you can find it at xml2ddl (it's a Python program, natch).
Briefly, my idea was to program the database schema in XML and to use diff like utility which shows what has changed between schema versions that you store in your version control. The reason it was done in XML was because DDL is too hard to parse, and every database has a different flavour. Unfortunately, I've had almost zero response and haven't worked on it in a while, but if I can scare up some more interest I may work on it some more.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.