Hyphenate.py implements Frank Liang's hyphenation algorithm (the one used in TeX) in Python.

This module provides a single function to hyphenate words. hyphenate_word takes a string (the word), and returns a list of parts that can be separated by hyphens:

>>> hyphenate_word("hyphenation")
['hy', 'phen', 'ation']
>>> hyphenate_word("supercalifragilisticexpialidocious")
['su', 'per', 'cal', 'ifrag', 'ilis', 'tic', 'ex', 'pi', 'ali', 'do', 'cious']
>>> hyphenate_word("project")
['project']

This Python code is in the public domain.

The module as provided only hyphenates English words, but if you can find TeX hyphenation patterns for another language (and can deal with the character encoding issues you'll encounter in them), the same algorithm will work for other languages.

The Liang algorithm does not provide all possible hyphenation points. It merely tries to provide some of them, without providing any wrong ones. So the set of breaks from hyphenate.py will be a subset of the full set of break points.

Download: hyphenate.py

See also

  • Hyphenation algorithm at Wikipedia, with links to other implementations of this same algorithm.
  • My blog, where other topics of interest to Python hyphenators are likely to appear.

Comments

[gravatar]
Doug Napoleone 1:12 AM on 10 Jul 2007

And there is already a django ticket to add it to the system ;-)

http://code.djangoproject.com/ticket/4821

[gravatar]
Jesper 11:46 AM on 10 Jul 2007

Isn't project supposed to be ['pro', 'ject'] ?

[gravatar]
Ned Batchelder 1:32 PM on 10 Jul 2007

"project" is one of the words Knuth explicitly added to the exceptions list as unhyphenated because the hyphenation depends on whether it is a noun or a verb. He may not be right about that, but that is why it is like that in this module.

[gravatar]
Deewiant 6:40 AM on 12 Jul 2007

"hyphenation" should be ['hy', 'phen', 'a', 'tion']. Similarly, in supercalifragilisticexpialidocious, 'ifrag', 'ilis', and 'ali' should be split into 'i' and 'frag', 'i' and 'lis', and 'al' and 'i'.

Project is indeed ['proj', 'ect'] as a noun and ['pro', 'ject'] as a verb.

[gravatar]
Ned Batchelder 7:00 AM on 12 Jul 2007

I didn't make it clear here, but the Liang algorithm does not provide all possible hyphenation points. It merely tries to provide some of them, without providing any wrong ones. So the set of breaks from hyphenate.py will be a subset of the full set of break points. I've updated the description above to reflect this limitation of the algorithm.

[gravatar]
Robert K 6:34 AM on 18 May 2010

Your recent post reminded me of this article, and I thought it would be worth pointing out that there's now an interesting Google Code project for doing hyphenation in JavaScript.

[gravatar]
Daniel Warmuth 1:17 AM on 12 Sep 2011

(@Deewlant:) There are different approaches to hyphenation, including for "hyphenation" - Merriam-Webster agrees with Liang/Knuth: http://www.merriam-webster.com/dictionary/hyphenation. Liangs PhD thesis is a good read on the topic: http://www.tug.org/docs/liang/

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.