A few years ago, I went looking for Python parsing tools. I spent a long time researching the various options. When I was done, I had a cheat sheet on the different alternatives. This is that cheat sheet, cleaned up a bit. It is very spotty. If you have updates to the information here, let me know. Because this is a compilation of factoids freely available on the web, it is in the public domain.

The tools are presented here in random order. I tried organizing them, but I couldn't find a scheme that seemed to help. Some points of comparison:

  • Lexical analysis technology: the two choices seem to be Python regular expressions, or deterministic finite automata. DFA's are faster (linear with the number of characters lexed).
  • Parser technology: what algorithm is used to parse? There are a spectrum of choices. See wikipedia's Parsing Algorithms page for a start.
  • Specification style: traditional parser generators use a separate file in a new language to describe the grammar. Some of these tools do also, but many choose to put production rules in docstrings. Others use Python data structures for grammars.
  • Reliance on other tools: some of these are pure Python, some rely on existing parser generators such as Bison.

The tools

Ply
Docstrings are used to associate lexer or parser rules with actions. The lexer uses Python regular expressions.
Parses: LALR(1)
Updated: February 2011, version 3.4.
License: Lesser GPL.
Discussion: ply-hack group.

pyparsing
Direct parser objects in python, built to parallel the grammar.
Updated: June 2011, version 1.5.6.
License: MIT license.

ANTLR
Parser and lexical analyzer generator in Java. Generates parsing code in Python (as well as Java, C++, C#, Ruby, etc).
Updated: July 2011, version 3.4.
License: BSD.

pyPEG
A parsing expression grammar toolkit for Python.
Updated: June 2012, version 2.4.3.
License: GPL.

pydsl
A language workbench written in Python.
Continuous releases on github.
License: GPLv3.

LEPL
A parser library for Python 2.6 and 3.0.
Updated: January 2012, version 5.0.1.
License: LGPL.

CodeTalker
Updated: July 2010, version 1.0.
License: MIT.

funcparserlib
A recursive descent parsing library based on functional combinators.
Updated: October 2009, version 0.3.4
License: MIT.

Simple Top-Down Parsing in Python
Not a tool exactly, but a methodology for writing top-down parsers in Python.
Updated: July 2008.

Pysec: Monadic Combinatoric Parsing in Python (aka Parsec in Python)
An exposition of using monads to build a Python parser.
Updated: February 2008.

picoparse
Small parser construction library for Python.
Updated: May 2010.

Aperiot
A grammar description language and parser generator for Python.
Updated: September 2007, version 0.1.2.
License: BSD license.

PyGgy
Lexes with DFA from a specification in a .pyl file. Parses GLR grammars from a specification in a .pyg file. Rules in both files have Python action code. Unlike most parser generators, the rule actions are all executed in a post-processing step. The parser isn't represented as a discrete object, but as globals in a module.
Updated: October 2004, version 0.4.1.
License: public domain.
Discussion: pyggy group.

Parsing
LR(1) parser generator as well as CFSM and GLR parser drivers.
Updated: August 2007, version 1.3.
License: MIT.

Rparse
LL(1) parser generator with AST generation.
Updated: November 2008, version 1.0.6b.
License: GPL.

SableCC
Java-based parser and lexical analyzer generator. Generates parsing code in Java, with alternative generators for other languages including Python.
License: GNU LGPL.
Updated: version 3.2

GOLD Parser
A multi-language "pseudo-open-source" parsing package.
Lexer: DFA
Parser: LALR
License: Freeware, based on zlib Open Source License.
Updated: July 2007, version 3.4.4.

Plex
Generates lexical analyzers in Python.
Lexes with DFA, specified in Python data structures. Supports multiple start states.
Updated: Jan 2007, version 1.1.5.
Support for Python 3: plex3.

yeanpypa
Constructs recursive-descent parsers directly in Python.
License: Public domain.
Updated: February 2007.

ZestyParser
Constructs recursive-descent parsers directly in Python.
License: GPL
Updated: March 2007, 0.7.0

BisonGen
Generates parsers in both Python and C, from an unusual XML-based grammar description file. Very sparsely documented (there doesn't seem to be a home page for it), and seems to have no direct connection to Bison, despite the name.
Updated: April 2005, version 0.8.0b1.

DParser for Python
Uses python docstrings as input to DParser, which is implemented in C. DParser is a scannerless GLR parser.
Updated: September 2004, version 1.13.
More: Charming Python: A look at DParser for Python.

Yapps
Produces recursive-descent parsers, as a human would write. Designed to be easy to use rather than powerful or fast. Better suited for small parsing tasks like email addresses, simple configuration scripts, etc.
License: MIT
Updated: August 2003, version 2.1.1.

PyBison
Reads docstrings in Python files to create an actual Bison grammar, runs it through Bison, then post-processes the C output to re-unite the generated parser with the Python action routines (I think).
License: GPL
Parses: LALR(1)
Updated: June 2004, version 0.1.8.
Quirks: Doesn't yet support Windows.

Yappy
Uses Python strings in list structures to declare the grammar and lexer rules, with semantic rules implemented as Python methods.
Lexer: based on Python regular expressions.
Parses: SLR, LR(1) and LALR(1)
License: GPL
Updated: June 2004.
Quirks: Uses python strings to declare the grammar.

Toy Parser Generator
Uses a simplistic parsing algorithm, but still seems relatively powerful.
Updated: October 2006, version 3.1.1.

kwParsing
Part of the Gadfly relational database engine.
Parses: SLR
Updated: January 2005, part of gadfly.

Martel
Martel uses a modified form of the Perl regular expression language to describe the format of a file. The definition is used to generate a parser for that format. An input file is converted into a parse tree, which is traversed in prefix order to generate SAX 2.0 events, as used in XML processing. Element names and attributes are specified in the regular expression grammar using the named group extension popularized by Python.
Updated: February 2005, part of Biopython 1.40b.

SimpleParse
Lexing and parsing in one step, but only deterministic grammars.
License: BSD.
Updated: 2006, version 2.1.0.
More: Charming Python: Parsing with the SimpleParse module.

SPARK
Uses docstrings to associate productions with actions. Unlike other tools, also includes semantic analysis and code generation phases.
Updated: stable April 2000 (0.6.1), pre-alpha May 2002.
More: Charming Python: Parsing with the Spark module.

mxTextTools
An unusual table-based parser. There is no generation step, the parser is hand-coded using primitives provided by the package. The parser is implemented in C for speed. This package underlies SimpleParse (just above).
License: eGenix Public License, similar to Python, compatible with GPL.
Updated: 2001.

FlexBisonModule
Macros to allow Flex and Bison to produce idiomatic lexers and parsers for Python. The generated lexers and parsers are in C, compiled into loadable modules.
Updated: March 2002, version 2.1.
License: Pythonesque

Bison In A Box
Uses standard Bison to generate pure Python parsers. It actually reads the bison-generated .c file and generates Python code!
Updated: June 2001, version 0.1.0.

Berkeley Yacc
Classic Yacc, extended to generate Python code. Python support seems to be undocumented.
Updated: November 2000.

PyLR
Lexer is based on regular expressions.
Parses: LR
Updated: December 1997.

Standard Modules

The Python standard library includes a few modules for special-purpose parsing problems. These are not general-purpose parsers, but don't overlook them. If your need overlaps with their capabilities, they're perfect:

  • shlex lexes command lines using the rules common to many operating system shells.
  • ConfigParser implements a basic configuration file parser language which provides a structure similar to what you would find on Microsoft Windows INI files.
  • email provides many services, including parsing email and other RFC-822 structures.
  • parser parses Python source text.
  • cmd implements a simple command interface, prompting for and parsing out command names, then dispatching to your handler methods.

See also

Comments

[gravatar]
Uche 8:16 AM on 2 Oct 2004

Re BisonGen.

True: docs are sparse. We developed it really as an internal tool for generating parsers needed in 4Suite, but got some interest in using it standalone, so started releasing versions of it.

Earlier versions of BisonGen used to generate a bison and flex file for second-srtage processing by the GNU tools, but Jeremy, in a fit of brilliant madness rewrote all the state table analysis and construction code from those packages in Python, so now you're right, it really has little to do with bison. Perhaps a name change is in order, but again given our shallow follow-thru w.r.t. BisonGen...

I'll at least cobble together a home page.

[gravatar]
Michael Truog 11:44 PM on 4 Oct 2005

Thanks for maintaining your list. It really helped me find a python parser.

[gravatar]
Arjun De 8:01 PM on 5 Oct 2005

Thanks a lot. Great list, came across pyparser thanks to this.

[gravatar]
Suman 10:13 PM on 9 Oct 2005

Nice compilation!
Was very helpful for knowing many things i did not know previously.

Also, it would be great if you can add an a rating and reviews to the tools so that it can help novices like me to select a parser and get going.It will be much better in the long run.

[gravatar]
Ned Batchelder 11:48 AM on 10 Oct 2005

Suman, I don't have the time to rate and review each of these. I tried to objectively describe them. Trying out each of them would be a much larger undertaking. And I don't know that my criteria would be the same as yours.

[gravatar]
Henning 4:50 PM on 14 Oct 2005

There is also a comparison on
http://wiki.python.org/moin/LanguageParsing

[gravatar]
Daniel 'Dang' Griffith 6:28 AM on 17 Oct 2005

In the standard modules, you might consider adding the cmd module. It's convenient when you don't want to bother with writing a grammar for a simple command-line tool. It uses a naming convention to map the user's input to function names; if a match is found, the function is called, otherwise, an error function is called.
--dang

[gravatar]
Steve 11:12 PM on 18 Dec 2005

Nice listing of resources. I hate to admit, but i don't know if i need a parser or not.. Essentially i know i spend a lot of time using regular expressions, but don't know if i can get a better deal with a parser.

The links i have so far focus on the technical aspects. So far i can not find detail on where lex / parse should and should not be used. Proly i should keep reading... Thanks for the info.

[gravatar]
ToddB 2:18 AM on 14 Feb 2006

Would be so helpful if was some sort of blurb about how fast these are. I am currently writing a mud, and will definitely need a parser, looked into many of these parsers, honestly speed is a very big issue. Unfortunately haven't seen any kind of benchmarks for most of these.

[gravatar]
judge 1:38 PM on 25 May 2006

Ned -- thanks.

I appreciate the overview. This is helping jumpstart me.
I also second the request for some kind of review.

-- joe

[gravatar]
Fabrice de fougerolles 3:04 AM on 27 Feb 2007

Hello, Thank for this pages! We use your work in order to choose the best tool for our needs. We work on a Flight Management projet (100 ingenieers)...

[gravatar]
Andy Elvey 6:15 PM on 9 Jun 2007

Hi. I recently found a **great** public-domain Python-based parsing library.

It's called "yeanpypa" (YEt ANother PYthon PArsing lib) and is inspired by PyParsing and Boost::Spirit (a C++-based parsing lib that I've used a fair bit).
Indeed, IMO yeanpypa feels very much like Spirit.

The main difference to some other parsing libs is that with (say) Spirit, you specify the BNF-grammar from the top down. So, to use a nonsensical but easy-to-understand example, if you did a BNF grammar for a book (say a novel), Spirit would do it as (in pseudo-code)-

Book = one-or-more chapters
Chapter = one-or-more pages
Pages = one-or-more paragraphs
Paragraph = one-or-more-lines
Line = one-or-more words

... and so on. In yeanpypa, you would do -

Words = one-or-more letters
Line = one-or-more words
Paragraph = one-or-more-lines
Pages = one-or-more paragraphs
Chapter = one-or-more pages
Book = one-or-more chapters

Yeanpypa is great! I've tried PyParsing but just couldn't get the hang of it. Then I tried yeanpypa and (having used Spirit) I "got it" *immediately!*

Here are the URLs for yeanpypa -
http://freshmeat.net/projects/yeanpypa/

http://www.slash-me.net/dev/snippets/yeanpypa/documentation.html

[gravatar]
gangesmaster 2:42 PM on 31 Aug 2007

'Construct' is a declarative framework for the definition of arbitrary data structures. These data structures, called 'constructs', allow both parsing and building (symmetrically).

http://construct.wikispaces.com

[gravatar]
regulate 12:18 AM on 18 Oct 2007

Ned, pyparsing.
pyparsing!!!!!!!!!!

[gravatar]
Matt Giuca 4:33 AM on 18 Apr 2009

Great! Thanks for the list. Looks like you've been maintaining this for *years*. Good job.

[gravatar]
random reddit user 3:19 AM on 1 Sep 2009

Python 2.6 has a json parsing module built in.

http://docs.python.org/library/json.html

[gravatar]
Craig McQueen 10:44 PM on 2 Dec 2009

Toy Parser Generator link is now http://christophe.delord.free.fr/tpg/index.html

[gravatar]
Julian S. Taylor 10:28 AM on 16 Jul 2010

This was fantastically helpful. Thanks for posting this. I'm toying with piPEG and testing others.

[gravatar]
Claudio Canepa 9:27 PM on 10 Jul 2011

Christophe Delord, the same author of Toy Parser Generator has published another parser named SP
link: http://www.cdsoft.fr/sp/
Last release: Sunday 22 May 2011
Seems interesting, but I haven't used yet.

PD: thanks Ned for the list, and other posters for updates

[gravatar]
random googler 1:39 PM on 13 Jan 2012

Did you forget the standard module tokenize? It fills somewhat the same niche as shlex, but provides a more detailed tokenization. Unlike shlex, which provides a shell-style tokenization, tokenize is more attuned to the lexing needs of Python itself.

The tokenize module also provides more information about where in the text stream each token appears. That information allows it to reconstitute a modified version of the original text from a modified token stream.

[gravatar]
Uche 8:52 PM on 5 Oct 2012

Glad this is still such a great reference. We never did get a chance to do much more work on BisonGen, but Just this week I needed a full-featured lexer for Python 3 (to implement MicroXML) and I ended up porting Plex to Python 3. It's a fair whack of a change so I'm posting it in case it helps others:

https://github.com/uogbuji/plex3

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.