|Ned Batchelder : Blog | Code | Text | Site|
Own your data: ad-hoc representations
» Home : Text
Created 27 May 2006
This may sound heretical in these days of standards for everything, but I've had the best successes by designing my own ad-hoc data formats. Rather than adopting (or worse, adapting) a standard to fit your purposes, you should create your own data representation. It will give you the best fit for the problem at hand.
A common impulse when sitting down to sketch out a data representation is to cast about for a standard that does something kind of like what you need, and try to squeeze it to fit your needs. This impulse is driven by these beliefs:
Probably some of these are right, but there are competing pressures as well:
The most important thing in choosing a data representation is whether it is expressive enough to meet your product's needs. I find the best strategy for designing a representation goes something like this:
The overriding concern during this design phase has to be representing your data as accurately and as deeply as possible. Ideally, you'd have no compromises due to trying to make another system fit your current needs. By putting your needs first, you'll create a better representation than if you adapt someone else's design.
You need to think about your data representation as a source form, with all of the formats you eventually need it in as object forms. Your code starts life in an expressive high-level language, and is then compiled down to an executable format. Your data needs to follow the same pattern: its core representation needs to be as high-level and expressive as possible, and needs to be convertable into object forms as needed.
This blog is home-grown, and the end result is not a spectacular success, but it works. When I first started, my blog posts were destined for only one place: an HTML page on this blog. But I didn't represent them as HTML, because that would have limited my flexibility. They are stored as XML files, using an ad-hoc tag language. For the parts of posts that overlap with HTML, the tag set is HTML. But for other parts, I simply make up tags that express my intent.
Here is a sample blog entry:
The body consists of simple HTML tags, but the metadata of the post uses tags I made up as I needed them. By using HTML for the body, I simplified the process of writing the posts, and the process of converting them to HTML. But even within the HTML, I have special tags for structures I want that HTML doesn't have.
At Kubi Software, we had a code-independent representation of our data schema. This schema was used to generate code in a number of languages, relational database schemas, data validation tables, and so on. One proposal for how to represent this schema was to use XML Schema. This would have made it simple for us to expose our schema in a standard way for users of our API.
But XML Schema was designed to solve a different problem than we had. It is used for describing classes of XML documents. We needed a solution that was far broader than that. For example, we needed to describe foreign key relationships for our database structures. XML Schema has no such concept. If we had used XML Schema, we'd have to extend it to include our foreign keys. True, XML Schema has made these sorts of extensions possible, but once you have to extend a standard in a proprietary way, you've lost some of the benefits you thought you had: the tools you got for free won't understand your extensions, and you're back to supporting your own ad-hoc representation.
At Kubi, we created a custom XML dialect to describe the schema. It was ad-hoc, and we had some hiccups extending it at times, and the semantics had a certain sloppiness to them, but it served its purpose. We converted it to XML Schema for API customers, and had straightfoward scripts to process it into all the other forms we needed.
We probably could have made do with XML Schema as a basis, but by the time we had retrofitted it with all of the extensions we needed, we'd have essentially had an ad-hoc representation anyway, except we'd have had to fight with the XML Schema syntax along the way for no good reason.
When we started building Tabblo, we realized that we would need to produce many forms of output from a single tabblo. It was tempting to focus on the web pages that were the first demoable part of the application, and simply use HTML and CSS as a representation, but that would have made the rest of the application difficult to build.
HTML is not a good way to produce high-quality output, PDF is. And producing PDF from HTML is not a simple process. So we went with a custom representation of templates and tabblos that allowed us the greatest flexibility. We looked at what we liked in HTML and CSS, what would work well for our needs, and what did not. We thought about how we'd have to generate HTML from our internals, and we'd have to do it efficiently. We designed a representation that let us talk about tabblos, layouts, and themes as we thought of them, not as HTML does.
For example, some of our themes include drop shadows beneath the photos. In HTML, the drop shadow is actually a second image beneath the photo image. But that's not a good core representation of a drop shadow. Better is a simple flag indicating that photos should have drop shadows. How the drop shadow is actually rendered in HTML is a run-time concern, not a design-time concern. In fact, over the last few months, we've changed the actual HTML rendering of drop shadows a few times, but the core description of them has not changed.
Since tabblos are not tied to HTML, we have more options for manipulating them. For example, consider the thumbnail tabblos used on navigation pages: we used to render them with a browser-on-a-leash. We'd squirt the tabblo HTML to the browser, then capture a bitmap from it, and scale it down. This was an awkward process at best, and involved some ugly scripting, and a lot of fiddly process manipulation.
With tabblos in an ad-hoc representation, we can produce PDFs for printing, as well as all sorts of thumbnail images directly with a completely non-HTML rendering process:
Not only have we simplified our processing pipelines by starting with a simpler data format than fully general HTML and CSS, but we have the power of going beyond what HTML+CSS can describe. For example, a drop shadow cannot be described as a drop shadow in those standards, and in the future, we'll have far more interesting things we'll want to describe. By using our own representation, we can take it in any direction we want to go.
For example, just yesterday we added user color control to the tabblo editor. Our tabblo representation was designed with this sort of tweaking in mind, so it gave us a really cool feature without a ton of reworking the insides.
Some might accuse me of suffering from Not Invented Here syndrome, but I don't agree. Choosing a data representation has to be considered carefully, and appropriateness has to count for more than similarity to someone else's solution.
Joel Spolsky (with whom I do not always agree) sums it up this way:
If it's a core business function — do it yourself, no matter what.
If the representation of your data is not a core business function, I don't know what is. Design your own format, own your data. You'll be glad you did.