A Bloom filter is a clever data structure that lets you check if an element is a member of a set, while storing much less data than the entire set. The tradeoff is that there are false positives: you might ask if a member is in the set, and get the answer “yes”, when it should have been “no”.
Li Fan provides a thorough overview of the technique in his paper about a Summary Cache that uses the technique, but here it is in a nutshell:
The set is stored as a large bit array (say 220 bits). To add an element to the set, you hash it a few different ways to produce a handful of different 20-bit hashes. Each hash value is an index into the bit array. Set each indexed bit to 1. Some of these bits will already be 1 (from previous entries).
To test if an element is in the set, hash it with all the hash functions, and check the indexed bits. If any of them is 0, then the element is not in the set. If all of them are 1, then it probably is in the set. They are probably all 1 because the element had been hashed into the set before, but they could be all 1 because of a collection of other entries which happened to each set some of the bits. This is the false positive. Adjusting the number of bits in the array and the number of hashes changes the probability that a false positive will occur.
The Fan paper has more about it all, including tables of false positive probabilities. Another source is Some Motley Bloom Tricks.
An aside about discovery: this is one of those web items that would be hard to credit properly with a simple “via” reference. The other day, Bob Congdon linked to the Dictionary of Algorithms and Data Structures. I poked around a bit, and glanced at their page about Bloom filters (which doesn’t say much) before eventually finding the subject of yesterday’s post about exact string matching.
Then today, I was trying to parse Mark Pilgrim’s rant about TypeKey (I think he’s against it, I can’t tell. Sarcasm doesn’t always work), when I saw his reference to LOAF (not the joke LOAF, this one seems to be a real proposal). The LOAF proposal is a way to securely send your address book with email messages so that anyone can check if an address is in your book, but can’t actually read your book. They do it with a Bloom filter, natch.
That made two references in two days to something I’d never heard of before, so I had to look into it more.
Then I wondered about Burton Bloom, the inventor of the technique. The citation on his CACM paper listed him as working at Computer Usage Company, in nearby Newton Upper Falls, and there seems to be little else about him on the web. This excursion into Bloom filters as mimics of human memory credits him with AI Memo 47 about chess.
A search on Computer Usage Company (which I had also never heard of) turns up a claim that it was the first software company, and that it was founded in 1955 by Elmer C. Kubie. His name struck me, since I work for Kubi Software, no relation.
In a final coincidence, Elmer Kubie died just two weeks ago in Ithaca.