Selecting randomly from an unknown sequence

Sunday 26 August 2012 — This is more than 12 years old. Be careful.

A was reminded today of a technique I’d read about somewhere before: how to choose randomly from a sequence of unknown length. The typical use is to choose a random line from a file.

The naive way to do it is to read the whole file, count the number of lines, choose a random number in that range, and that’s your line. But that requires either holding the whole file in memory, or reading the file twice. When I first heard of this problem, I didn’t see how it could be done any other way. Surely you need to know how many you’re choosing from to select uniformly?

It turns out you don’t need to know the length of the sequence in advance, you can choose as you go so that the end result is uniform. The cleverer way to do it is to choose a line with a decreasing probability as the sequence goes along, so that at any time, you have an element from the sequence with a uniform probability:

import random

def random_element(seq):
    """Return an element chosen at random from `seq`."""
    it = None
    for n, elem in enumerate(seq):
        if random.randint(0, n) == 0:
            it = elem
    return it

Note that due to Python’s duck-typing, using this function on an open file will return a random line from the file.

To see why this function works, consider it inductively. For the initial case, we’ll always pick the first element as “it”, since a random int between 0 and 0 will be 0. So after one element, we’ve chosen uniformly from the one element we’ve seen.

For the inductive case, imagine we’ve read N-1 elements already, and “it” is a random element chosen uniformly from those N-1 elements. The chance that the Nth element should be chosen instead is 1/N, because at this point, the chance that any particular line should be “it” is 1/N. So we choose a random number, and if it’s the 1/N outcome, we take the Nth line as “it.” In the case that we don’t take the new line, which is (N-1)/N, the old line is kept. It was chosen randomly from the N-1 lines that preceded this one, so each line has a final chance of ((N-1)/N)/(N-1), or 1/N, or being chosen. Therefore the new line and each of the old lines is equally likely, so the distribution is uniform.

Since the initial case meets our uniform-selection criterion, and we’ve shown that if N-1 satisfies it then N satisfies it, then by induction, our selection will be uniformly random for any N.

The original question had to do with picking 10 random lines, so how do we generalize to a larger selection than one? We keep a list of K elements we’ve chosen, and only if the next one meets the appropriately decreasing probability over time, do we add it to our collection, bumping an old value randomly:

def random_elements(seq, k):
    """Return `k` elements chosen randomly from `seq`."""
    them = []
    for n, elem in enumerate(seq):
        if len(them) < k:
            them.append(elem)
        else:
            if random.randint(0, n) < k:
                them[random.randint(0, k-1)] = elem
    return them

Now the Nth element has a k/N chance of being in the result, so we modify our selection criteria, and if it’s chosen, we choose an old value to replace.

I can’t say I’ve ever needed these functions, but I like their elegance and their surprising possibility. There’s something very pleasing about a simple algorithm that runs counter to intuition.

Note: Blckknight’s answer has a slightly nicer implementation of the random_elements function.

Comments

Kevin Edwards 12:22 AM on 27 Aug 2012

That's neat! It didn't quite click for me until I confirmed that the distribution actually is uniform by calculating the probability of keeping the previously selected item. e.g.:

The 0th item has a (1-1/2) = 1/2 probability of making it through N=2, and (1-1/3) = 2/3 probability of making it through N=3. So, the probability of both occuring is 1/2 * 2/3 = 1/3.

So:
for N=3: 1/2 * (1-1/3) = 1/2 * 2/3 = 1/3
for N=4: 1/2 * (1-1/3) * (1-1/4) = 1/2 * 2/3 * 3/4 = 1/4
...

Simon Brunning 9:43 AM on 27 Aug 2012

Raymond Hettinger showed me how to to this some years ago over on the Python Cookbook.

Tom 11:57 AM on 27 Aug 2012

I didn't see this mentioned anywhere, apologies if I missed it. This algorithm is a variant of a Fisher-Yates shuffle.

The second randint, i.e. "random.randint(0, k-1)", is not necessary (and in fact not in Blckknight's version)

http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_.22inside-out.22_algorithm

Nate Finch 2:08 PM on 27 Aug 2012

A particularly gnarly interviewer at our company used to ask this question to programming candidates, and I've seen it elsewhere as a suggested interview question, too. Once I stepped through it, I got it... but it doesn't seem like a terribly useful trick (nor a terribly useful interview question, to be honest). I spend most of my days trying to get my programs *not* to do random things :)

Andreas 3:16 PM on 27 Aug 2012

The first function only gives the result 0. Tested like this:

print [random_element(range(100)) for _ in range(100)]

Ned Batchelder 3:27 PM on 27 Aug 2012

@Andreas, not sure what you did wrong, but when I use the code from the blog post with your snippet, I get a nice random list of numbers.

Andreas 3:39 PM on 27 Aug 2012

Nevermind, the indentation of the return statement got mangled with copy-pasting. And a histrogram indeed confirms that the distribution is uniform.

Florian 6:39 PM on 27 Aug 2012

I think you are missing the important bit in your proof that any of the (N - 1) elements has a probability of 1 / (N - 1) of being chosen when considering the N-th, plus the probability of 1 - (1 / N) to not be replaced, which makes 1 / N. Your proof only shows (trivially) that the N-th has as probability of 1 / N, unless I misunderstand.

Ned Batchelder 7:15 PM on 27 Aug 2012

@Florian, You are right, I've added more to the proof.

Kevin Edwards 7:33 PM on 27 Aug 2012

That's indirectly why I commented, too. :) I hope it's complementary.

Chris Lomont 9:02 PM on 27 Aug 2012

There is a more efficient way than to test each element for inclusion. Each time an element is added, the distribution over all futures can be used to determine the next one to include, and then wait for that one or just jump ahead. This is a better method, requiring much less computation. I think Vitter wrote the paper on this.

NoName 10:24 PM on 27 Aug 2012

http://en.wikipedia.org/wiki/Reservoir_sampling

Calvin Spealman 4:19 AM on 28 Aug 2012

This is a fantastic little trick!

I can actually see a lot of use cases, if you consider that it helps not only when the total number of elements is unknown, but when the counting of them or the querying of all the elements is prohibitively expensive.

Consider a platform like Google App Engine, where every Datastore query costs and has a low limit max of only 1000. If you had a forum and wanted to display a random common from each thread in a listing page, this would be a way to calculate the random comment as each comment is made, never performing an additional query.

Jacob Silterra 2:25 PM on 28 Aug 2012

IGV (http://www.broadinstitute.org/igv/) uses this algorithm for loading alignments (sequencing result from DNA, with some experimental info after it has been aligned to a location on the genome). We used to use a buffering system, whereby one would load many alignments into memory and then downsample, but that had a tendency to run out of memory. It's Java, not Python, but the algorithm is the same.

Tim Lesher 12:33 PM on 30 Aug 2012

This is one of my favorite "algorithmic one-liners"!

Like Simon, I learned about it via the Cookbook.

Chris Green 12:56 AM on 21 Sep 2012

Late to the party, but ... this neat little algorithm only makes sense if, for a list of length n, producing n random numbers is computationally less expensive than walking through the list twice (well - at least once, maximum twice, expected value 1.5 times (?)). Or not?

Also, what happens if you generate just one random floating-point number [0..1) and stretch & reuse it at each step. You get a non-uniform distribution? And if so, why?

Kevin Edwards 11:14 AM on 21 Sep 2012

@Chris Green: I think you're right -- it only makes sense if seeking to a particular line is more expensive than generating n random numbers.

But I don't understand what you mean about stretching and reusing a single random number. You need a uniformly distributed random number generated at each step in order to make sure that the result is likewise a uniformly distributed random selection. If it's biased, the result will be biased.

Ned Batchelder 7:42 PM on 23 Sep 2012

@Chris: you are right, if computing random numbers is slower than reading the sequence, this is a silly thing to do.

If by stretching the random number you mean using more of its bits, it only has 54 or so bits of randomness, since it's a float. So there's only so much you can do with one random number.

Adam Lindsay 12:57 PM on 28 Oct 2012

Ooh, I remember this fondly from the Perl4 Camel book, actually.

Selecting randomly from an unknown sequence

Comments

Add a comment: