Enumerating trees

Monday 11 February 2008

I was hacking around with a program that needed tree-structured data, and I wanted to generate all possible trees of a certain size. I searched around and found nothing useful. After a few experiments involving Python generators, postfix expressions, and recursive trees, I've got some code to enumerate binary trees.

I started with postfix expressions because they are easy to work with, and are equivalent to trees, in that trees are like infix notation expressions, and they are equivalent to postfix. After some puzzling on it, I saw the recursive light, and came up with this to generate all possible postfix expressions:

# Enumerate postfix expressions

def _exprs(expr, stack_depth, vals, ops):
    """ Generate postfix expressions recursively.
        `expr` is the expression so far, a list of values and operators.
        `stack_depth` is the depth of the stack created by expr.
        `vals` is a list of values remaining to add to the expression.
        `ops` is a list of possible operators to choose from.
    if stack_depth == 1 and not vals:
        # This is a valid expression.
        yield expr
    if stack_depth > 1:
        # Try using an operator
        for o in ops:
            for e in _exprs(expr+[o], stack_depth-1, vals, ops):
                yield e
    if vals:
        # Vals are available, push one on the stack
        for e in _exprs(expr+[vals[0]], stack_depth+1, vals[1:], ops):
            yield e

def exprs(vals, ops):
    """ Generate postfix expressions created from `vals`, the list of values
        to use, and `ops`, the possible operators to combine them with.
    return _exprs([], 0, vals, ops)

def all_exprs(n, ops='+'):
    return [ " ".join(e) for e in exprs(vals, ops) ]

print "\n".join(all_exprs(4, '+*'))

for i in range(2, 15):
    ndyck = len(all_exprs(i))
    print "%2d%7d expressions" % (i, ndyck)

The nice thing here is the generator is a pretty clear expression of the enumeration technique: as you are building an expression:

  • if the stack has one item, and there are no more values to squeeze in, you have a valid expression.
  • if the stack is two or more deep, then you can add to expression by using an operator to combine those top two values, and the stack is now one shallower.
  • if you have more values to use, you can append the next one to the expression, and the stack is now one deeper.

The nature of recursive generators requires the for/yield structure:

# When recurring in generators, you have to loop to pull the values,
# and yield each one:
for v in _recursive_generator(blah):
    yield v
# Wouldn't it be cool if you could do this instead:
yield _recursive_generator(blah)

# but that would yield a generator, a style of recursion I can't quite wrap
# my head around!

Running the code shows all the expressions of two operators on five values, and the counts of expressions of one operator over varying numbers of values:

A B + C + D +
A B + C + D *
A B + C * D +
A B + C * D *
A B + C D + +
A B + C D + *
A B + C D * +
A B + C D * *
A B * C + D +
A B * C + D *
A B * C * D +
A B * C * D *
A B * C D + +
A B * C D + *
A B * C D * +
A B * C D * *
A B C + + D +
A B C + + D *
A B C + * D +
A B C + * D *
A B C + D + +
A B C + D + *
A B C + D * +
A B C + D * *
A B C * + D +
A B C * + D *
A B C * * D +
A B C * * D *
A B C * D + +
A B C * D + *
A B C * D * +
A B C * D * *
A B C D + + +
A B C D + + *
A B C D + * +
A B C D + * *
A B C D * + +
A B C D * + *
A B C D * * +
A B C D * * *
 2:       1 expressions
 3:       2 expressions
 4:       5 expressions
 5:      14 expressions
 6:      42 expressions
 7:     132 expressions
 8:     429 expressions
 9:    1430 expressions
10:    4862 expressions
11:   16796 expressions
12:   58786 expressions
13:  208012 expressions
14:  742900 expressions

I still wanted to find other work on this stuff, and I figured anyone who had successfully enumerated trees would also have done a count like I did, so I did a search on the last two counts (208012 742900). I was directed to a few interesting pages, in particular the wikipedia page on Catalan numbers, which confirmed that my count of trees is correct, and led to the theory of Dyck languages, which is too much math for my purposes, and led to a few interesting papers that I couldn't understand.

To make trees, my first impulse was to use my expression generator and then execute the expressions to construct trees, but that seemed unnecessary, so I adapted the expression code to create trees directly:

# Enumerate binary trees

class Leaf:
    def __init__(self, val):
        self.val = val
    def __str__(self):
        return str(self.val)

class Node:
    def __init__(self, op, left, right):
        self.op = op
        self.left, self.right = left, right

    def __str__(self):
        return "(%s %s %s)" % (self.left, self.op, self.right)
def _trees(stack, vals, ops):
    if len(stack) == 1 and not vals:
        yield stack[0]
    if len(stack) > 1:
        for op in ops:
            left = stack[-2]
            right = stack[-1]
            if hasattr(right, 'op') and right.op == op:
                # Avoid duplicates: our operations are associative,
                # so we can skip trees like (A+(B+C)), since we'll
                # get ((A+B)+C) instead on another iteration.
            new_node = Node(op, left, right)
            new_stack = stack[:-2]
            for t in _trees(new_stack, vals, ops):
                yield t
    if vals:
        for t in _trees(stack, vals[1:], ops):
            yield t
def trees(vals, ops="+"):
    return _trees([], vals, ops)

def all_trees(n, ops='+'):
    return trees(vals, ops)

print "\n".join(map(str, all_trees(4, '+*')))

Which produces:

(((A + B) + C) + D)
(((A + B) + C) * D)
(((A + B) * C) + D)
(((A + B) * C) * D)
((A + B) * (C + D))
((A + B) + (C * D))
(((A * B) + C) + D)
(((A * B) + C) * D)
(((A * B) * C) + D)
(((A * B) * C) * D)
((A * B) * (C + D))
((A * B) + (C * D))
((A * (B + C)) + D)
((A * (B + C)) * D)
(A * ((B + C) + D))
(A + ((B + C) * D))
((A + (B * C)) + D)
((A + (B * C)) * D)
(A * ((B * C) + D))
(A + ((B * C) * D))
(A + (B * (C + D)))
(A * (B + (C * D)))

One tricky thing about this generator: it avoids trees which are structurally different, but with associate operators produces the same result, so it won't return both ((A+B)+C) and (A+(B+C)). If you want both of those trees, remove the check at line 25 for right associativity.

The number of trees grows very large as the number of leaves becomes big enough to be interesting. I guess I'll have to dig into application-specific ways to limit the choices.


andrew 10:30 AM on 11 Feb 2008

Is this hyper-geekiness because Damien is now some kind of über geek in *his* blog? Honestly, I wish that Damien would go back to talking about liquid poop and you would talk about video games and Legos (tm)

Seriously, recursion is the tool of the devil. I am pretty sure that you have found reason #2 for recursion to exist in programming languages (#1 being a gee-whiz Fibonacci calculator that we all did in CS 100).

All recursion ever did for me was blow the stack.

Michael Chermside 10:52 AM on 11 Feb 2008


Then you're missing out. There are LOTS of problems that are best solved with recursion. For instance, it isn't just enumerating trees... pretty much ANYTHING having to do with trees is best implemented using recursion. Blowing the stack with recursion isn't a problem if you're using it to walk a tree, really it's only a problem when you use recursion to do iteration (and then only in languages like Java or Python that don't handle recursion "properly", but that's a given).

lorg 1:39 PM on 11 Feb 2008

This could be useful for writing fuzzers (and unit-testing) for tree-like structures.
Not too long ago I wrote about a possible compiler fuzzer, so it would make sense to implement one with a similar technique.

Iain 2:59 PM on 11 Feb 2008

I always check Sloane's On-Line Encyclopedia of Integer Sequences for this sort of thing:

2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900

Ned Batchelder 5:08 PM on 11 Feb 2008

@lorg: yes, I imagine there are lots of testing possibilities here.

@Iain: Sloane's looks a little barebones, but could be very useful.

susan senator 5:09 PM on 11 Feb 2008

Where have you been, other than Florida? Ned's always been Uber Geeky, regardless of Damien and his sofa.

But I'm with you on recursion, especially the eternal sort. So last century.

Bob Congdon 7:37 PM on 11 Feb 2008

Ned, cool stuff. I played with it in Iron Python for a bit.

By the way, if you're looking for real Über geeks, go rent The King Of Kong.

Ned Batchelder 8:18 PM on 11 Feb 2008

I got an email from Sanghyeon Seo pointing out that Donald Knuth is writing about this very topic in ACP vol 4, available as a fascicle: http://www-cs-faculty.stanford.edu/~knuth/fasc4a.ps

A quick look through there showed a few very dense algorithms to generate trees that seem much more efficient than mine, but I don't understand them!

Francisco 3:51 PM on 27 Jun 2011

Hi Ned, excellent work, your code has been really useful in my work.

I was wondering if it could be adapted to allow binary/unary trees. I mean trees with binary and unary operations.

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.