Own your data: ad-hoc representations

Created 27 May 2006

This may sound heretical in these days of standards for everything, but I’ve had the best successes by designing my own ad-hoc data formats. Rather than adopting (or worse, adapting) a standard to fit your purposes, you should create your own data representation. It will give you the best fit for the problem at hand.

How do I represent my data?

A common impulse when sitting down to sketch out a data representation is to cast about for a standard that does something kind of like what you need, and try to squeeze it to fit your needs. This impulse is driven by these beliefs:

Standard is inherently better than non-standard,
Using a standard will likely get you a bunch of free tools,
The standard was written by smart people, and so is more “right” than something you’d cook up yourself.

Probably some of these are right, but there are competing pressures as well:

The standard wasn’t designed to solve exactly your problem,
The standard was written by a committee, and so is a compromise,
The standard was likely heavily constrained by some existing technology in the first place,
Standards are most often written to be good for interchange, not core representation.

The most important thing in choosing a data representation is whether it is expressive enough to meet your product’s needs. I find the best strategy for designing a representation goes something like this:

Consider all of the things the data must do, and try to include a best guess as to ways it will need to be extended in the future. Include both static representation issues (“we’ll need a way to include Kanji”) and dynamic processing issues (“we’ll need to be able to get a list of just headlines”).
Consider all of the different forms the data will have to be converted to. For example, your data may have to appear on screen, and be printed, and be aggregated into an XML stream for an existing system someplace.
The problem you are trying to solve is, “How do I best represent all of the concerns from #1, while making all of the conversions in #2 as simple as possible?”
Look at other systems and standards that cover some of your ground. One may be the best solution to your problem. If not, you can borrow concepts or syntax for the pieces of your problem where they do a good job.

The overriding concern during this design phase has to be representing your data as accurately and as deeply as possible. Ideally, you’d have no compromises due to trying to make another system fit your current needs. By putting your needs first, you’ll create a better representation than if you adapt someone else’s design.

You need to think about your data representation as a source form, with all of the formats you eventually need it in as object forms. Your code starts life in an expressive high-level language, and is then compiled down to an executable format. Your data needs to follow the same pattern: its core representation needs to be as high-level and expressive as possible, and needs to be convertible into object forms as needed.

Case in point: this blog

This blog is home-grown, and the end result is not a spectacular success, but it works. When I first started, my blog posts were destined for only one place: an HTML page on this blog. But I didn’t represent them as HTML, because that would have limited my flexibility. They are stored as XML files, using an ad-hoc tag language. For the parts of posts that overlap with HTML, the tag set is HTML. But for other parts, I simply make up tags that express my intent.

Here is a sample blog entry:

<?xml version='1.0' encoding='utf-8'?>
<blog>
<entry when='20050726T083044'>
<title>IPod Flea</title>
<category>funny</category>
<category>music</category>
<via href="http://boingboing.net">Boing Boing</via>
<body>
<p>Awesome:
<a href='http://www.layersmagazine.com/features/feature_cs2/flea.htm'>iPod Flea</a>.
</p>
</body>
</entry>
</blog>

The body consists of simple HTML tags, but the metadata of the post uses tags I made up as I needed them. By using HTML for the body, I simplified the process of writing the posts, and the process of converting them to HTML. But even within the HTML, I have special tags for structures I want that HTML doesn’t have.

Case in point: Kubi schema representation

At Kubi Software, we had a code-independent representation of our data schema. This schema was used to generate code in a number of languages, relational database schemas, data validation tables, and so on. One proposal for how to represent this schema was to use XML Schema. This would have made it simple for us to expose our schema in a standard way for users of our API.

But XML Schema was designed to solve a different problem than we had. It is used for describing classes of XML documents. We needed a solution that was far broader than that. For example, we needed to describe foreign key relationships for our database structures. XML Schema has no such concept. If we had used XML Schema, we’d have to extend it to include our foreign keys. True, XML Schema has made these sorts of extensions possible, but once you have to extend a standard in a proprietary way, you’ve lost some of the benefits you thought you had: the tools you got for free won’t understand your extensions, and you’re back to supporting your own ad-hoc representation.

At Kubi, we created a custom XML dialect to describe the schema. It was ad-hoc, and we had some hiccups extending it at times, and the semantics had a certain sloppiness to them, but it served its purpose. We converted it to XML Schema for API customers, and had straightforward scripts to process it into all the other forms we needed.

We probably could have made do with XML Schema as a basis, but by the time we had retrofitted it with all of the extensions we needed, we’d have essentially had an ad-hoc representation anyway, except we’d have had to fight with the XML Schema syntax along the way for no good reason.

Case in point: Tabblo templates

When we started building Tabblo, we realized that we would need to produce many forms of output from a single tabblo. It was tempting to focus on the web pages that were the first demoable part of the application, and simply use HTML and CSS as a representation, but that would have made the rest of the application difficult to build.

HTML is not a good way to produce high-quality output, PDF is. And producing PDF from HTML is not a simple process. So we went with a custom representation of templates and tabblos that allowed us the greatest flexibility. We looked at what we liked in HTML and CSS, what would work well for our needs, and what did not. We thought about how we’d have to generate HTML from our internals, and we’d have to do it efficiently. We designed a representation that let us talk about tabblos, layouts, and themes as we thought of them, not as HTML does.

For example, some of our themes include drop shadows beneath the photos. In HTML, the drop shadow is actually a second image beneath the photo image. But that’s not a good core representation of a drop shadow. Better is a simple flag indicating that photos should have drop shadows. How the drop shadow is actually rendered in HTML is a run-time concern, not a design-time concern. In fact, over the last few months, we’ve changed the actual HTML rendering of drop shadows a few times, but the core description of them has not changed.

Since tabblos are not tied to HTML, we have more options for manipulating them. For example, consider the thumbnail tabblos used on navigation pages: we used to render them with a browser-on-a-leash. We’d squirt the tabblo HTML to the browser, then capture a bitmap from it, and scale it down. This was an awkward process at best, and involved some ugly scripting, and a lot of fiddly process manipulation.

With tabblos in an ad-hoc representation, we can produce PDFs for printing, as well as all sorts of thumbnail images directly with a completely non-HTML rendering process:

Tabblo: Charlotte&s Web with the 2nd Grade

Not only have we simplified our processing pipelines by starting with a simpler data format than fully general HTML and CSS, but we have the power of going beyond what HTML+CSS can describe. For example, a drop shadow cannot be described as a drop shadow in those standards, and in the future, we’ll have far more interesting things we’ll want to describe. By using our own representation, we can take it in any direction we want to go.

For example, just yesterday we added user color control to the tabblo editor. Our tabblo representation was designed with this sort of tweaking in mind, so it gave us a really cool feature without a ton of reworking the insides.

Works for me

Some might accuse me of suffering from Not Invented Here syndrome, but I don’t agree. Choosing a data representation has to be considered carefully, and appropriateness has to count for more than similarity to someone else’s solution.

Joel Spolsky (with whom I do not always agree) sums it up this way:

If it’s a core business function — do it yourself, no matter what.

If the representation of your data is not a core business function, I don’t know what is. Design your own format, own your data. You’ll be glad you did.

Comments

Peter Bengtsson 7:43 AM on 28 May 2006

I very much agree with you and Joel. I too have had much success with inventing "my own standard" but only with core functionality stuff. Sometimes it's not easy to see what is your "core business" because things might grow and change. Had I known that our intranet blogs was going to be so important I wouldn't have tried to fit our needs into shitty existing products; I would have built it myself. However, there are stuff that you care less about which slowly approaches a solution that we could have downloaded and adapted a long time ago. Aka. reinventing the wheel.

Long live ad-hoc solutions!

Manuzhai 4:12 PM on 28 May 2006

Well, sure, sometimes, but using standards is pretty cool as well. For example, I converted my weblog to mostly just use Atom, since it really does mostly everything you need. Similarly, I often create some pretty dense, clean XML representation in my backend, and just use XSLT for templating. That also easily gets me PDF (via XSL-FO). Standards are pretty cool.

Sami Shalabi 9:56 AM on 29 May 2006

Totally agree Ned! Thanks for writing your thoughts up.

Comment posted here:
http://samishalabi.blogspot.com/2006/05/my-data-format-is-mine.html

kevin 6:29 AM on 31 May 2006

I'll echo a point made above - Leveraging a standard enables you to leverage other components and infrastructure developed to the standard.

In most cases the core business values of a product are above the plumbing and where you are able to leverage readily available infrastructure you are able to invest more engineering resources in the development of business specific product functionality. Plumbing needs to be maintained and if a significant portion of your product is ad-hoc infrastructure the maintenance burden can also be significant.

Standards-based infrastructure may not be ideally suited to a specific application however many offerings are quite capable and have been designed to accomodate adaptation or extention without necessarily loosing the values of being standards-based.