Python parsing tools
Created 30 September 2004, last updated 29 December 2012
NOTE: I am no longer updating this page. Michael Bernstein has a copy at Python Parsing Tools that will be easier to keep up-to-date. The python.org wiki also has a page called LanguageParsing.
A few years ago, I went looking for Python parsing tools. I spent a long time researching the various options. When I was done, I had a cheat sheet on the different alternatives. This is that cheat sheet, cleaned up a bit. It is very spotty. Because this is a compilation of factoids freely available on the web, it is in the public domain.
The tools are presented here in random order. I tried organizing them, but I couldn’t find a scheme that seemed to help. Some points of comparison:
- Lexical analysis technology: the two choices seem to be Python regular expressions, or deterministic finite automata. DFA’s are faster (linear with the number of characters lexed).
- Parser technology: what algorithm is used to parse? There are a spectrum of choices. See wikipedia’s Parsing Algorithms page for a start.
- Specification style: traditional parser generators use a separate file in a new language to describe the grammar. Some of these tools do also, but many choose to put production rules in docstrings. Others use Python data structures for grammars.
- Reliance on other tools: some of these are pure Python, some rely on existing parser generators such as Bison.
Docstrings are used to associate lexer or parser rules with actions. The lexer uses Python regular expressions.
Updated: February 2011, version 3.4.
License: Lesser GPL.
Discussion: ply-hack group.
Direct parser objects in Python, built to parallel the grammar.
Updated: June 2011, version 1.5.6.
License: MIT license.
Parser and lexical analyzer generator in Java. Generates parsing code in Python (as well as Java, C++, C#, Ruby, etc).
Updated: July 2011, version 3.4.
A parsing expression grammar toolkit for Python.
Updated: June 2012, version 2.4.3.
A language workbench written in Python.
Continuous releases on github.
A parser library for Python 2.6 and 3.0.
Updated: January 2012, version 5.0.1.
Updated: July 2010, version 1.0.
A recursive descent parsing library based on functional combinators.
Updated: October 2009, version 0.3.4
Simple Top-Down Parsing in Python
Not a tool exactly, but a methodology for writing top-down parsers in Python.
Updated: July 2008.
Pysec: Monadic Combinatoric Parsing in Python (aka Parsec in Python)
An exposition of using monads to build a Python parser.
Updated: February 2008.
Small parser construction library for Python.
Updated: May 2010.
A grammar description language and parser generator for Python.
Updated: September 2007, version 0.1.2.
License: BSD license.
Lexes with DFA from a specification in a .pyl file. Parses GLR grammars from a specification in a .pyg file. Rules in both files have Python action code. Unlike most parser generators, the rule actions are all executed in a post-processing step. The parser isn’t represented as a discrete object, but as globals in a module.
Updated: October 2004, version 0.4.1.
License: public domain.
Discussion: pyggy group.
LR(1) parser generator as well as CFSM and GLR parser drivers.
Updated: August 2007, version 1.3.
LL(1) parser generator with AST generation.
Updated: November 2008, version 1.0.6b.
Constructs recursive-descent parsers directly in Python.
License: Public domain.
Updated: February 2007.
Constructs recursive-descent parsers directly in Python.
Updated: March 2007, 0.7.0
Generates parsers in both Python and C, from an unusual XML-based grammar description file. Very sparsely documented (there doesn’t seem to be a home page for it), and seems to have no direct connection to Bison, despite the name.
Updated: April 2005, version 0.8.0b1.
DParser for Python
Uses Python docstrings as input to DParser, which is implemented in C. DParser is a scannerless GLR parser.
Updated: September 2004, version 1.13.
More: Charming Python: A look at DParser for Python.
Produces recursive-descent parsers, as a human would write. Designed to be easy to use rather than powerful or fast. Better suited for small parsing tasks like email addresses, simple configuration scripts, etc.
Updated: August 2003, version 2.1.1.
Reads docstrings in Python files to create an actual Bison grammar, runs it through Bison, then post-processes the C output to re-unite the generated parser with the Python action routines (I think).
Updated: June 2004, version 0.1.8.
Quirks: Doesn’t yet support Windows.
Uses Python strings in list structures to declare the grammar and lexer rules, with semantic rules implemented as Python methods.
Lexer: based on Python regular expressions.
Parses: SLR, LR(1) and LALR(1)
Updated: June 2004.
Quirks: Uses Python strings to declare the grammar.
Toy Parser Generator
Uses a simplistic parsing algorithm, but still seems relatively powerful.
Updated: October 2006, version 3.1.1.
Martel uses a modified form of the Perl regular expression language to describe the format of a file. The definition is used to generate a parser for that format. An input file is converted into a parse tree, which is traversed in prefix order to generate SAX 2.0 events, as used in XML processing. Element names and attributes are specified in the regular expression grammar using the named group extension popularized by Python.
Updated: February 2005, part of Biopython 1.40b.
Lexing and parsing in one step, but only deterministic grammars.
Updated: 2006, version 2.1.0.
More: Charming Python: Parsing with the SimpleParse module.
Uses docstrings to associate productions with actions. Unlike other tools, also includes semantic analysis and code generation phases.
Updated: stable April 2000 (0.6.1), pre-alpha May 2002.
More: Charming Python: Parsing with the Spark module.
An unusual table-based parser. There is no generation step, the parser is hand-coded using primitives provided by the package. The parser is implemented in C for speed. This package underlies SimpleParse (just above).
License: eGenix Public License, similar to Python, compatible with GPL.
Macros to allow Flex and Bison to produce idiomatic lexers and parsers for Python. The generated lexers and parsers are in C, compiled into loadable modules.
Updated: March 2002, version 2.1.
Bison In A Box
Uses standard Bison to generate pure Python parsers. It actually reads the bison-generated .c file and generates Python code!
Updated: June 2001, version 0.1.0.
Classic Yacc, extended to generate Python code. Python support seems to be undocumented.
Updated: November 2000.
Lexer is based on regular expressions.
Updated: December 1997.
The Python standard library includes a few modules for special-purpose parsing problems. These are not general-purpose parsers, but don’t overlook them. If your need overlaps with their capabilities, they’re perfect:
- shlex lexes command lines using the rules common to many operating system shells.
- ConfigParser implements a basic configuration file parser language which provides a structure similar to what you would find on Microsoft Windows INI files.
- email provides many services, including parsing email and other RFC-822 structures.
- parser parses Python source text.
- cmd implements a simple command interface, prompting for and parsing out command names, then dispatching to your handler methods.
- For general information about lexing and parsing technologies, the wikipedia articles Lexical Analyzer and Parsing Algorithms are good starts.
- For a more in-depth review of a few of these tools, see Martin Löwis’s Towards a Standard Parser Generator.
- Another comparison of Python parsers is on the Python wiki: LanguageParsing.
- For other stuff of interest to Pythonistas, you could do a lot worse than my blog.