Saturday 14 October 2006 — This is 18 years old. Be careful.
From a roundup of the new full-text search feature in SQLite, I found a reference to the Porter Stemmer, an algorithm for reducing an English word to its root (sort of):
cat --> cat
catatonic --> cataton
catatonically --> cataton
catch --> catch
catched --> catch
catches --> catch
catching --> catch
caught --> caught
abstemiously --> abstemi
manliness --> manli
mainlines --> mainlin
I say “sort of” because abstemi is not exactly the root of abstemiously, but is close enough. The point of the algorithm is to map words from the same root onto the same value. In this way, it’s like a hash algorithm. The fact that the value it comes up with is often the root, or usually very similar to the root, is a nice side effect, but if it mapped all of the “catch-” words onto 117243, it would still work as the basis for a full-text search engine.
And of course, English is a bastard when it comes to this sort of thing, as you can see from the “caught” case.
BTW: Martin Porter (the author of the algorithm) has a marvelously (root: marvel) dry witty home page which starts off with a startlingly accurate prediction (roots: startlingli accur predict).
Also BTW: abstemiously is a word with all six vowels, in alphabetical order.
Comments
This probably has something to do with the learned behaviors that we have from working with indexes: when I'm looking for something on Google that I've seen before, I look for a peculiar turn of a phrase of words that appear in order. For example, if I wanted to google for this post I'd look for "abstemi is not exactly the root of abstemiously"... very easy for the index to find if it's keeping the whole words, but hard to if not :).
You may have already seen it, but one project that's built on Porter's algorithm is the Snowball project. They've built a framework for implementing various algorithms easily (in a slightly strange language that is useful for the domain, I gather).
I've recently been writing some code to do automatic topic extraction and correlation from news articles and Snowball has been quite useful in that department. It's kind of fun to play with this stuff. Glyph's comment about human expectation seems valid from my experience, but for some automatic processing tasks, stemming is a useful tool in the arsenal.
Add a comment: