Line continuations from tokenize.generate_tokens

Thursday 24 September 2009

OK, this is really geeky, but I wish I had found it on the interwebs, so I’m putting it here for the next guy.

tokenize.generate_tokens is a very useful function in the Python standard library: it tokenizes Python source code, generating a stream of tokens. I used it to add syntax coloring to the HTML reporting in coverage.py.

But it has a flaw, which the docs hint at:

The line passed (the last tuple item) is the logical line; continuation lines are included.

If you’ve continued a source line with a backslash:

def my_function(arguments):
    a = very_long_function(arguments) + \
        another_really_long_function(arguments) + \
        so_that_we_have_to_wrap_the_line_with_backslashes()

then generate_tokens doesn’t ever give you a token with that backslash as the text. If you’re trying to recreate the Python source from the tokens, the backslashes will be missing.

Googling this problem turns up some muttering about how something ought to be done about it, but no solutions. It turned out not to be too hard to wrap the token generator to insert the needed backslashes:

def phys_tokens(toks):
    """Return all physical tokens, even line continuations.
    
    tokenize.generate_tokens() doesn't return a token for the backslash
    that continues lines.  This wrapper provides those tokens so that we
    can re-create a faithful representation of the original source.
    
    Returns the same values as generate_tokens()
    
    """
    last_line = None
    last_lineno = -1
    for ttype, ttext, (slin, scol), (elin, ecol), ltext in toks:
        if last_lineno != elin:
            if last_line and last_line[-2:] == "\\\n":
                if ttype != token.STRING:
                    ccol = len(last_line.split("\n")[-2]) - 1
                    yield (
                        99999, "\\\n",
                        (slin, ccol), (slin, ccol+2),
                        last_line
                        )
            last_line = ltext
        yield ttype, ttext, (slin, scol), (elin, ecol), ltext
        last_lineno = elin

Use it by passing it the generate_tokens generator:

tokgen = tokenize.generate_tokens(source_file.readline)
physgen = phys_tokens(tokgen)
for ttype, ttext, (slin, scol), (elin, ecol), ltext in physgen:
    # Blah blah, process tokens as usual

Comments

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.