Sunday 26 August 2012 — This is 12 years old. Be careful.
A was reminded today of a technique I’d read about somewhere before: how to choose randomly from a sequence of unknown length. The typical use is to choose a random line from a file.
The naive way to do it is to read the whole file, count the number of lines, choose a random number in that range, and that’s your line. But that requires either holding the whole file in memory, or reading the file twice. When I first heard of this problem, I didn’t see how it could be done any other way. Surely you need to know how many you’re choosing from to select uniformly?
It turns out you don’t need to know the length of the sequence in advance, you can choose as you go so that the end result is uniform. The cleverer way to do it is to choose a line with a decreasing probability as the sequence goes along, so that at any time, you have an element from the sequence with a uniform probability:
import random
def random_element(seq):
"""Return an element chosen at random from `seq`."""
it = None
for n, elem in enumerate(seq):
if random.randint(0, n) == 0:
it = elem
return it
Note that due to Python’s duck-typing, using this function on an open file will return a random line from the file.
To see why this function works, consider it inductively. For the initial case, we’ll always pick the first element as “it”, since a random int between 0 and 0 will be 0. So after one element, we’ve chosen uniformly from the one element we’ve seen.
For the inductive case, imagine we’ve read N-1 elements already, and “it” is a random element chosen uniformly from those N-1 elements. The chance that the Nth element should be chosen instead is 1/N, because at this point, the chance that any particular line should be “it” is 1/N. So we choose a random number, and if it’s the 1/N outcome, we take the Nth line as “it.” In the case that we don’t take the new line, which is (N-1)/N, the old line is kept. It was chosen randomly from the N-1 lines that preceded this one, so each line has a final chance of ((N-1)/N)/(N-1), or 1/N, or being chosen. Therefore the new line and each of the old lines is equally likely, so the distribution is uniform.
Since the initial case meets our uniform-selection criterion, and we’ve shown that if N-1 satisfies it then N satisfies it, then by induction, our selection will be uniformly random for any N.
The original question had to do with picking 10 random lines, so how do we generalize to a larger selection than one? We keep a list of K elements we’ve chosen, and only if the next one meets the appropriately decreasing probability over time, do we add it to our collection, bumping an old value randomly:
def random_elements(seq, k):
"""Return `k` elements chosen randomly from `seq`."""
them = []
for n, elem in enumerate(seq):
if len(them) < k:
them.append(elem)
else:
if random.randint(0, n) < k:
them[random.randint(0, k-1)] = elem
return them
Now the Nth element has a k/N chance of being in the result, so we modify our selection criteria, and if it’s chosen, we choose an old value to replace.
I can’t say I’ve ever needed these functions, but I like their elegance and their surprising possibility. There’s something very pleasing about a simple algorithm that runs counter to intuition.
Note: Blckknight’s answer has a slightly nicer implementation of the random_elements function.
Comments
The 0th item has a (1-1/2) = 1/2 probability of making it through N=2, and (1-1/3) = 2/3 probability of making it through N=3. So, the probability of both occuring is 1/2 * 2/3 = 1/3.
So:
for N=3: 1/2 * (1-1/3) = 1/2 * 2/3 = 1/3
for N=4: 1/2 * (1-1/3) * (1-1/4) = 1/2 * 2/3 * 3/4 = 1/4
...
The second randint, i.e. "random.randint(0, k-1)", is not necessary (and in fact not in Blckknight's version)
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_.22inside-out.22_algorithm
I can actually see a lot of use cases, if you consider that it helps not only when the total number of elements is unknown, but when the counting of them or the querying of all the elements is prohibitively expensive.
Consider a platform like Google App Engine, where every Datastore query costs and has a low limit max of only 1000. If you had a forum and wanted to display a random common from each thread in a listing page, this would be a way to calculate the random comment as each comment is made, never performing an additional query.
Like Simon, I learned about it via the Cookbook.
Also, what happens if you generate just one random floating-point number [0..1) and stretch & reuse it at each step. You get a non-uniform distribution? And if so, why?
But I don't understand what you mean about stretching and reusing a single random number. You need a uniformly distributed random number generated at each step in order to make sure that the result is likewise a uniformly distributed random selection. If it's biased, the result will be biased.
If by stretching the random number you mean using more of its bits, it only has 54 or so bits of randomness, since it's a float. So there's only so much you can do with one random number.
Add a comment: