« | » Main « | »

Selecting randomly from an unknown sequence

Sunday 26 August 2012

A was reminded today of a technique I'd read about somewhere before: how to choose randomly from a sequence of unknown length. The typical use is to choose a random line from a file.

The naive way to do it is to read the whole file, count the number of lines, choose a random number in that range, and that's your line. But that requires either holding the whole file in memory, or reading the file twice. When I first heard of this problem, I didn't see how it could be done any other way. Surely you need to know how many you're choosing from to select uniformly?

It turns out you don't need to know the length of the sequence in advance, you can choose as you go so that the end result is uniform. The cleverer way to do it is to choose a line with a decreasing probability as the sequence goes along, so that at any time, you have an element from the sequence with a uniform probability:

import random

def random_element(seq):
    """Return an element chosen at random from `seq`."""
    it = None
    for n, elem in enumerate(seq):
        if random.randint(0, n) == 0:
            it = elem
    return it

Note that due to Python's duck-typing, using this function on an open file will return a random line from the file.

To see why this function works, consider it inductively. For the initial case, we'll always pick the first element as "it", since a random int between 0 and 0 will be 0. So after one element, we've chosen uniformly from the one element we've seen.

For the inductive case, imagine we've read N-1 elements already, and "it" is a random element chosen uniformly from those N-1 elements. The chance that the Nth element should be chosen instead is 1/N, because at this point, the chance that any particular line should be "it" is 1/N. So we choose a random number, and if it's the 1/N outcome, we take the Nth line as "it." In the case that we don't take the new line, which is (N-1)/N, the old line is kept. It was chosen randomly from the N-1 lines that preceded this one, so each line has a final chance of ((N-1)/N)/(N-1), or 1/N, or being chosen. Therefore the new line and each of the old lines is equally likely, so the distribution is uniform.

Since the initial case meets our uniform-selection criterion, and we've shown that if N-1 satisfies it then N satisfies it, then by induction, our selection will be uniformly random for any N.

The original question had to do with picking 10 random lines, so how do we generalize to a larger selection than one? We keep a list of K elements we've chosen, and only if the next one meets the appropriately decreasing probability over time, do we add it to our collection, bumping an old value randomly:

def random_elements(seq, k):
    """Return `k` elements chosen randomly from `seq`."""
    them = []
    for n, elem in enumerate(seq):
        if len(them) < k:
            them.append(elem)
        else:
            if random.randint(0, n) < k:
                them[random.randint(0, k-1)] = elem
    return them

Now the Nth element has a k/N chance of being in the result, so we modify our selection criteria, and if it's chosen, we choose an old value to replace.

I can't say I've ever needed these functions, but I like their elegance and their surprising possibility. There's something very pleasing about a simple algorithm that runs counter to intuition.

Note: Blckknight's answer has a slightly nicer implementation of the random_elements function.

Fixing broken Unicode

Tuesday 21 August 2012

In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.

It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn't a valid sequence of bytes at all. We've all seen web pages with A-hats on them:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python — after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.

This is the kind of code I'm not sure I would have attempted, given the "impossibility" of the task. Bravo to Rob for taking it on.

Paul Ryan

Saturday 11 August 2012

Mitt Romney has chosen Paul Ryan as a running mate, and I think it is an excellent choice. I say this not because I ever want either of them to lead the country, but because it provides a clear distinction between the parties.

Ryan is to be commended for proposing his alternative budget. He believes that you should not just oppose things you think are wrong, but instead, you should provide alternatives. His budget is an alternative, and a stark one. I hope that his candidacy will bring his budget to the fore, because I think it is a great demonstration of what the Republican party wants.

Ryan's budget was written to address the problem of the deficit, a problem I believe is a real one. My issue with Ryan's budget is how he proposes to solve the deficit problem, and I wish people better understood his solution.

Ryan's answer to the deficit is to ask more of those with the least, when we could be asking more of those with the most.

To be clear: you can solve a deficit in one of two ways, and the deficit doesn't care which you choose: you can cut spending, or you can raise revenue. Those that blare, "it's the spending that's the problem!" are being disingenuous. The deficit is the difference between two numbers, and any way you bring those numbers closer together will help.

Only if you also have a second (or actually, first) agenda of reducing the size of government do you think it matters which way you reduce the deficit.

The only way to ask more of those with the most is to increase taxes. Conservatives claim that those with the most create jobs, but mostly what they do is invest, and investing is not what really creates jobs: spending creates jobs. The way to get more spending is to help those with the least have enough money to spend. Letting those with the most have more will not get them to spend, they already have enough to spend.

So let's welcome Paul Ryan to the presidential campaign, and let's hope that his ideas get a wider airing. I think they are bad ideas, and I hope more than half the voters in November think so too.

« | » Main « | »