Sunday 18 December 2005 — This is 19 years old. Be careful.
One of the trickier aspects of building applications on relational databases is upgrading the database schema. As features get added to the application, columns and tables get added to the database. At some point, you want to install the new code onto an existing installation. How do you change the database schema to add the columns and tables without losing the existing data?
In my last two startups, I was put in charge of figuring this out. The two designs I came up with had some differences, because there were different requirements and constraints, but they had these aspects in common:
- Revisions of the database schema were assigned serial numbers by the developers who made changes.
- The developer had to manually write code (an update clause) to change the database schema, though there were helpers to isolate him from as many messy details as possible.
- The database recorded which schema number it currently had.
- When started, the application would compare the current schema number in the database with the current schema number in the code. If they differed, it would find and run the proper update clauses in sequence to bring the database up to date.
We’ll have this same problem in my next job, though this time it will be on top of the Django framework. The Django dudes understand the issues (the first three comments on the Model Reference documentation page are about automatically upgrading databases), but may not tackle it for 1.0.
There have been at least a few attempts at solutions, but I think they are heading down the “too automatic” path.
What I found doing this in the past in real applications was that database upgrades are too complicated to be done automatically. The simple cases (adding a nullable column, dropping a column) can be made very simple. For example, the developer writes an upgrade clause that includes a call to a DropColumn function. But the complicated cases are always going to be beyond the reach of automatic generation. Real upgrades include things like taking an existing text column and splitting it into rows in a new foreign-keyed table, or adding a non-nullable column with existing values being computed from other tables. Upgrades also include stuff like realizing your application had been writing invalid data for a while, and running update statements to fix the bad rows.
The real world is messy. There’s just no way to leave the developer out of the solution.
Comments
I think a good solution might be to forget relational dbs altogether and go with a schema-less Db system. Of course, since I'm writing my own schema-less db system (CouchDb), perhaps I'm slightly biased.
Antonio: I saw the Stepwise paper yesterday, and it is quite similar to the systems I have built. For a couple of reasons, I wouldn't require updates to be strictly sequential. That is, a database at schema 37 could update to schema 61 in a single step.
Stuart: I can't imagine working in an environment where developers have to manually apply SQL updates when pulling new code from source control. But a SQL diff tool won't get you very far. It might be a helpful starting point, but a diff of your Django models or whatever else is truly your source form of the schema would probably work as well. And there will always be a need for hand-crafting update clauses.
The database is just at a single version, but you can also provide scripts that upgrade skipping intermediate versions. This way you can effectively cull a version if you mess up the upgrade but haven't applied it to some versions of the database.
If I was going to improve it, I might try to better guess what the SQL updates are, and present the programmer with that to edit. But I agree that an explicit confirmation and potentially editing of the SQL is always necessary.
The idea is that the old versions of the models will be checkpointed somewhere, with a simple copying scheme builtin, and plugins for stuff like subversion. The plan is that transitions will be created in an interactive process (via the webserver, probably), but can also be edited to add unusual bits. I plan to allow at the least field type transitions, combining, splitting, and hopefully things like count-field->entity transitions.
Robert: I know all about SQLite's limitations. The system I built for Kubi Software included SQLite support, and I had to code around those limitations. I had a DropColumn() method that would do the crazy dance to recreate the table (with indexes), but without the column-to-be-dropped. It was a pain, but fascinating to get working right.
Your core insight, though, is totally correct. It's impossible to do this automatically. The best you can do is provide a nice hook to make it simple for a developer to write code that will be run at the appropriate time.
Briefly, my idea was to program the database schema in XML and to use diff like utility which shows what has changed between schema versions that you store in your version control. The reason it was done in XML was because DDL is too hard to parse, and every database has a different flavour. Unfortunately, I've had almost zero response and haven't worked on it in a while, but if I can scare up some more interest I may work on it some more.
Add a comment: