Metadata is nothing new

Created 26 September 2003

Web technologists talk about “metadata” as a powerful enabling technology. It is; there’s no arguing with that. But they miss plenty of opportunities to point out that although “metadata” is a new word, it’s an old concept. In fact, its age is proof that metadata is an important idea. It’s been helping information technologists for centuries.

What is metadata?

Here’s the usual definition of “metadata” these days:

Metadata is data about data.

This is a good definition when talking about electronic data, and is how we get to use the cool “meta-” prefix. A broader and more useful definition is:

Metadata is information about a thing, apart from the thing itself.

This definition lets us talk about metadata in the real world.

Metadata in the real world

Let’s consider a book. The core of the book are the words themselves. For example, Moby Dick begins, “Call me Ishmael”. But aside from the actual text of the book, there’s also information about the book. Its title is “Moby Dick”, its author is Herman Melville, its date of publication is 1851, and so on. This is information about the book that isn’t found in the text of the book. This is metadata.

In the real world, metadata is often available on the outside of things. Books typically have information printed on the cover: title, author, publisher, shelving category, ISBN number. Books also have metadata inside: the copyright page, title page, the headings on the pages themselves, even the page numbers. All of this is information about the book, rather than the text of the book itself. When libraries buy a book, they add even more metadata: card catalog numbers, copy numbers, date acquired.

Of course, other things besides books have metadata. Music CDs have artist, title, track listing, and parental advisories. Clothing has size, washing instructions, and manufacturer. Food packages have weight and nutritional information. New cars have stickers detailing their gas mileage.

In addition to the metadata available with the item itself, There’s also metadata published elsewhere. In the library, there’s the card catalog. In bookstores, there’s Books in Print, a large book filled with nothing but metadata about other books. And publishers provide information about their publications in catalogs, promotional materials, etc.

All of this is metadata: it’s information about a thing, rather than the thing itself.

Metadata has been around for a long time. Millennia ago, when libraries consisted of piles of rolled-up scrolls, little tags where attached so you could tell which scroll was which without unrolling them. That tag was metadata.

What is metadata good for?

Think about how a library works. The books are organized by subject, by author, or sometimes in special collections. How is it done? By using metadata to find the right place for the book. Imagine a library where all the books were unmarked manuscripts, with no metadata. Hand a clerk a book to shelve, and he’d sit down and start reading the book. Who knows how many pages he’d have to read before he’d properly categorized the book. It would take a long time, and would be subject to his interpretation of the book. He’d have to be much cleverer than the guy who could just read “Non-fiction/Bird watching” off the back of the book with metadata. And as we mentioned above, some of the interesting metadata, such as the author’s name, isn’t in the book at all, but only in the metadata.

Metadata is used in the real world just as it is in the virtual world: to organize information. Of course, the book is more important than the metadata: if you had to choose between having the book without the cover, or the cover without the book, you’d choose the book. You could still read the book, and in fact, if you have read the book, or intend to read the book, the cover doesn’t provide much extra value. But if you need to deal with lots of books, for example, in a bookstore or a library, the cover is invaluable. As an extreme example, in many video stores, the store itself consists of nothing but covers (to prevent theft of the video), and the store operates perfectly well: customers browse, make selections, and rent videos.

To sum up: metadata is used to organize, manipulate, shelve, locate, categorize, and otherwise work with data when you don’t want to actually deal with the data itself.

Metadata on the web

The purpose and use of metadata is the same on the web as in the real world. Metadata provides information about data so that the data can be worked with in bulk, without having to read all the source data. When I publish my blog, there is the main information (the posts themselves), there is metadata visible on the page (the title of the entry, the date of the entry), there is metadata invisible on the page (I include geeky things like my geographical coordinates), and there is metadata published on separate pages (my archive page, and my RSS feed).

Just like the library example from the real world, all this metadata is used to help people find information, organize information, summarize information, and so on. Except that in the online world, this information is used by other software to do all those things automatically. For example, because I have geographical coordinates in my metadata, geourl.org can automatically tell me about other bloggers near me.

In electronic form, it can be hard to tell where the data ends and the metadata begins. Technorati is a popular blog link-tracking tool. It finds blogs by using metadata published by blog trackers such as blo.gs. Then it reads the blog entries themselves to find the links. These links are part of the actual data of the blog, but are accessible to software, just as the metadata is.

Metadata: now and forever

Metadata serves an incredibly valuable purpose: letting us step back from our information and talk about it rather than just use it.

Electronic data on the web has made metadata even more powerful: both the original data and the metadata about it are published in similar ways. We can collect metadata together, and publish metadata about it, creating ever more powerful aggregations of data, building amazing new applications that draw upon ever broader ecosystems of data.

But it’s nothing new.