Filter a list into two parts

Tuesday 11 June 2013This is ten years old. Be careful.

A recent discussion on comp.lang.python asked how to Split a list into two parts based on a filter?, and the answers were interesting and insightful.

The question was by Roy Smith. He asked about dividing a list of songs into two lists based on a predicate, like this:

new_songs = [s for s in songs if s.is_new()]
old_songs = [s for s in songs if not s.is_new()]

but without iterating the list twice and calling the predicate twice for each element. He also provided this implementation, which iterates the list only once:

new_songs = []
old_songs = []
for s in songs:
    if s.is_new():
        new_songs.append(s)
    else:
        old_songs.append(s)

which got shortened to:

new_songs, old_songs = [], []
for s in songs:
    (new_songs if s.is_new() else old_songs).append(s)

This works just fine for a simple list, but suppose the original sequence to be divided was a generator, and you don’t want to buffer all the values in memory? Is there a way to do this lazily?

Usually, we go to the itertools module for things like this, but this problem is tricky: we need to get two lazy iterators at once.

Chris Angelico came up with this remarkable machine, with a deque tweak by Tim Chase:

def iterpartition(pred, it):
    """Partition an iterable based on a predicate.

    Returns two iterables, for those with pred False and those True.
    """
    falses, trues = collections.deque(), collections.deque()
    it = iter(it)
    def get_false():
        while True:
            if falses:
                yield falses.popleft()
            else:
                while True:
                    val = next(it)
                    if pred(val):
                        trues.append(val)
                    else:
                        break
                yield val
    def get_true():
        while True:
            if trues:
                yield trues.popleft()
            else:
                while True:
                    val = next(it)
                    if not pred(val):
                        falses.append(val)
                    else:
                        break
                yield val
    return get_false(), get_true()

Here two generators are created as closures, and returned as the result of the function. The generators cooperate to work over the original iterable, calling the predicate only once for each element, and buffering only enough values to find the next one you want.

Peter Otten contributed this gem:

def partition(items, predicate=bool):
    a, b = itertools.tee((predicate(item), item) for item in items)
    return ((item for pred, item in a if not pred),
            (item for pred, item in b if pred))

Here we finally have itertools helping us. This behaves like Chris’ code, buffering enough values for either returned iterable to provide the next requested value. But we don’t have to do the buffering, itertools.tee does it for us, and the predicate is called only once for each element in the iterable. Itertools.tee gets a generator expression that produces the original elements along with the result of calling the predicate. That is tee’d into two iterables, which are then fed to two generator expressions to select the proper values. Clever.

When I first read the question on the list, I didn’t see how to improve on the list-building code. Now I’ve got two interesting implementations, and I learned a few things!

Comments

[gravatar]
I've always done this ugly when operating on unbound generators.

test = predicate(data)
If test:
yield (data, None)
else:
yield (None, data)
[gravatar]
Nice! Should this be added to the itertools recipes on the python documentation?
[gravatar]
You make it more itertoolsy by following an example recipe from the itertools docs:

http://docs.python.org/dev/library/itertools.html#itertools-recipes

It looks like this, which I think is also more readable:
def partition(pred, iterable):
    t1, t2 = itertools.tee(iterable)
    return itertools.filterfalse(pred, t1), itertools.filter(pred, t2)
[gravatar]
Will Hardy, that evaluates the predicate twice for each item: “but without … calling the predicate twice for each element”
[gravatar]
Piotr Dobrogost 5:55 PM on 12 Jun 2013
Am I missing something here or is this the exemplary case when you should not use tee() according to the following statement from the docs "In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee()"?
[gravatar]
I wrote something similar a few months ago in response to a question in #python: https://gist.github.com/Julian/5230776 (and coincidentally yesterday it came up again).
[gravatar]
@Piotr: itertools.tee will buffer the data that has been returned from one iterator, but not yet returned from the other. In other words, it will buffer all the data between the current positions on each iterator. So whether or not you should use it depends a lot on how you will pull data from the two iterators.

I haven't touched at all here on how different usages might skew the preference for one solution over another.
[gravatar]
This problem and variations on it were my motivation for creating sendtools (https://pypi.python.org/pypi/sendtools).

BC
[gravatar]
To test my understanding of the tee solution above, and also because I didn't like the use of a and b in the partition function I decided to generalize the function.

The function partitionn below will partition items under the same restrictions as mentioned but into N parts given a predicate that returns an integer 0 .. n-1 when acting on any item.
>>> import itertools
>>> 
>>> def partitionn(items, predicate=int, n=2):
    tees = itertools.tee( ((predicate(item), item)
                          for item in items), n )
    return ( (lambda i:(item for pred, item in tees[i] if pred==i))(x)
              for x in range(n) )

>>> partitions = partitionn(items=range(15), predicate=lambda x: x % 2, n=2)
>>> for p in partitions: print(p, list(p))

[generator object [genexpr] at 0x02D3C7D8] [0, 2, 4, 6, 8, 10, 12, 14]
[generator object [genexpr] at 0x02D3C8F0] [1, 3, 5, 7, 9, 11, 13]
>>> 
>>> partitions = partitionn(items=range(15), predicate=lambda x: x % 4, n=4)
>>> for p in partitions: print(p, list(p))

[generator object [genexpr] at 0x02D3C940] [0, 4, 8, 12]
[generator object [genexpr] at 0x02D3C850] [1, 5, 9, 13]
[generator object [genexpr] at 0x02DCAEE0] [2, 6, 10, 14]
[generator object [genexpr] at 0x02DCAF80] [3, 7, 11]
>>>
(Note that in the printed output of generator objects I changed the angle brackets to square brackets so it would not be consumed by the comment editor).
[gravatar]
usually i filter it once with a list comprehension and then make a set out of both lists, usings sets you can easily get the remaining elements out.
[gravatar]
I asked this same question on stackoverflow a long time ago:

http://stackoverflow.com/questions/1359232/generate-two-lists-at-once

This was the best answer I got:
aprime, bprime = zip(*[(a1,b1) for a1,b1,c1 in zip(a,b,c) if c1==0])
[gravatar]
Following on from this there is Filtering an iterator into N parts lazily where I expand on the theme. (Thanks Ned - great subject).
[gravatar]
words = ["foo", "bar", "baz", "spam", "eggs", "cheese"]

those_words = [w for w in words if w in ("foo", "bar")]
other_words = list(set(words) - set(those_words))
[gravatar]
Wow, Peter Otten's "gem" is truly elegant. That's certainly going to be making an appearance in my projects.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.