hyphenate

Hyphenate.py implements Frank Liang’s hyphenation algorithm (the one used in TeX) in Python.

This module provides a single function to hyphenate words. hyphenate_word takes a string (the word), and returns a list of parts that can be separated by hyphens:

>>> hyphenate_word("hyphenation")
['hy', 'phen', 'ation']
>>> hyphenate_word("supercalifragilisticexpialidocious")
['su', 'per', 'cal', 'ifrag', 'ilis', 'tic', 'ex', 'pi', 'ali', 'do', 'cious']
>>> hyphenate_word("project")
['project']

This Python code is in the public domain.

The module as provided only hyphenates English words, but if you can find TeX hyphenation patterns for another language (and can deal with the character encoding issues you’ll encounter in them), the same algorithm will work for other languages.

The Liang algorithm does not provide all possible hyphenation points. It merely tries to provide some of them, without providing any wrong ones. So the set of breaks from hyphenate.py will be a subset of the full set of break points.

Download: hyphenate.py

See also

Comments

[gravatar]
And there is already a django ticket to add it to the system ;-)

http://code.djangoproject.com/ticket/4821
[gravatar]
Isn't project supposed to be ['pro', 'ject'] ?
[gravatar]
"project" is one of the words Knuth explicitly added to the exceptions list as unhyphenated because the hyphenation depends on whether it is a noun or a verb. He may not be right about that, but that is why it is like that in this module.
[gravatar]
"hyphenation" should be ['hy', 'phen', 'a', 'tion']. Similarly, in supercalifragilisticexpialidocious, 'ifrag', 'ilis', and 'ali' should be split into 'i' and 'frag', 'i' and 'lis', and 'al' and 'i'.

Project is indeed ['proj', 'ect'] as a noun and ['pro', 'ject'] as a verb.
[gravatar]
I didn't make it clear here, but the Liang algorithm does not provide all possible hyphenation points. It merely tries to provide some of them, without providing any wrong ones. So the set of breaks from hyphenate.py will be a subset of the full set of break points. I've updated the description above to reflect this limitation of the algorithm.
[gravatar]
Your recent post reminded me of this article, and I thought it would be worth pointing out that there's now an interesting Google Code project for doing hyphenation in JavaScript.
[gravatar]
(@Deewlant:) There are different approaches to hyphenation, including for "hyphenation" - Merriam-Webster agrees with Liang/Knuth: http://www.merriam-webster.com/dictionary/hyphenation. Liangs PhD thesis is a good read on the topic: http://www.tug.org/docs/liang/
[gravatar]
Do not confuse the algorithm and the hyphenation patterns. The algorithm can cut any word provided you supply the adequate patterns, and you can always change the pattern set to include specific ones for the words you find are wrongly hyphenated.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.