Iterator merge in Python

Tuesday 3 October 2006

Although I've been working in Python for a long time, every once in a while I come across a chunk of code that makes me stop and think, and admire its elegance and Pythonic nature. Raymond Hettinger's recipe for merging multiple sorted iterators is one of those pieces of code. It's only 29 lines including the supporting stuff, but it opened my eyes in these ways:

  • It works with iterators in unusual ways, including taking a list of objects, turning them into iterators by mapping iter over them, and then iterating over the list of iterators.
  • It manipulates the iterators directly, by calling the next() function explicitly. Usually I'd think about the implementation of iterators involving the next() method only when considering writing an unusual iterator, not when using an iterator.
  • It uses a standard module that had escaped my attention, heapq for using a list as a sorted heap.
  • It uses object methods as first-class objects that can be pushed into the heap along with the data to be sorted.

There's nothing like a succinct thought-provoking snippet to start your day.

tagged: » 6 reactions

Comments

[gravatar]
Calvin Spealman 5:17 AM on 4 Oct 2006

Not impressed with this one. Far too much work for such a triffle thing. It does too much of the grunt work when Python already includes most of the bells and whistles. Simplicity trumps all else.

def merge(*iters):
    iters = list(iter(i) for i in iters)
    while iters:
        for i in iters:
            try:
                yield i.next()
            except StopIteration, e:
                iters.remove(i)

print sorted(merge([1,2,3,5], [4,5]))

[gravatar]
Ned Batchelder 7:39 AM on 4 Oct 2006

Calvin, yours is shorter, but misses an important point in Raymond's: yours pulls the entire contents of the iterators into memory, while his does not. Imagine trying to merge 5 sorted 2Gb files together. His will use very little memory (only 5 lines of text need to be stored at once), while yours will consume 10Gb.

[gravatar]
Ned Batchelder 7:43 AM on 4 Oct 2006

One more thing: Calvin's version of course has use where memory use is not a concern, and can be simplified further, using only standard modules:

from itertools import chain
print sorted(chain([1,2,3,4], [4,5]))

[gravatar]
Nicolas Lehuen 5:49 PM on 5 Oct 2006

I don't want to brag or anything, but I wrote something very close to Raymond a few day before he entered his version in the cookbook :

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/466302

It include all the nifty stuff about using heapq and calling next() on iterators. Well, yeah, I guess I'm bragging :) At least Raymond's version is generic and not mixed with file management.

[gravatar]
Calvin Spealman 5:59 PM on 5 Oct 2006

How embarrasing! I misread the recipe originally. I did not notice the nifty trick of pushing both the value and the next function, and somehow interpreted it as pushing all values into the heap immediately. This is why I mistakenly saw nothing in the recipe that even cared or took advantage of the iterables being pre-sorted, and thus shot it down.

I conceed my disagreement.

[gravatar]
Graham 1:47 PM on 6 Oct 2006

Neat. It's worth noting the similarity of the (val, next-method) approach to the "streams" model in SICP which is fascinating (required?) reading.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.