Sunday 15 June 2008 — This is over 16 years old. Be careful.
Full justification on the web is usually a bad idea.
Typographers use full justification to get an elegant-looking block of type. The straight right edge is a strong visual element on the page, and can add to the controlled overall look. But typographers care about more than just the outline of the rectangle. They care about the evenness of the type within the rectangle, something they call “color”. The goal is to get an evenly filled area, with no large changes in density.
Because full justification involves stretching word spaces, if a line has to be stretched too much, the spaces become wide enough to be noticeable white blobs on the page. The line of text is then “too loose”, and interrupts the flow of reading.
In traditional typography, hyphenation is used to reduce the need to make loose lines. By breaking words into smaller chunks, the lines can be filled more naturally, and they don’t have to be stretched too far.
But web browsers don’t hyphenate. As a result, paragraphs often suffer. Here are some examples from the OpenID news for February 2008:
(I’ve blurred it a bit to emphasize the color.) This paragraph is OK, with just two problem lines: the fourth (“some of the top ...”) and fourth from the bottom (“to support the community ...”). These lines are loose enough that I stumble when reading them, as if they were typed “some .. of .. the .. top ..”
But then we come to the other problem with full justification on the web:
Occasionally URLs appear in paragraphs, and these are very large “words” that completely screw up the line before them. Technical writing is especially prone to this as other non-word content appears in running text, such as function names.
I think full justification is one of those technology hold-overs: the new technology trying to mimic the old. Books and newspapers use full justification, so we try to do it on the web also. But content on the web rarely appears in a constrained rectangle. Full justification in print is appealing partly because the justified right edge of the text is a good echo of the right edge of the paper, or of the left edge of the next column in a newspaper. In a single column of text in the middle of a browser window, full justification isn’t gaining you much, and brings you pain in the form of loose lines.
Except in specialized cases, or where you know very clearly what type of content will appear, you shouldn’t use full justification on the web. The lack of hyphenation is a killer.
As it happens, there are browser-side hyphenation solutions, but they also have their drawbacks: code size and execution time.
BTW: it isn’t just the web that suffers from hyphen-less justification. Amazon’s Kindle has the same problem, something I noticed right away when I first tried one out. I’m not sure why they wouldn’t have built hyphenation into a reading device. And I’m reading a Salman Rushdie book published by Penguin which uses no hyphenation. Why would a traditionally-published book forgo the tried and true technology of good-looking pages?
Comments
But all of the these techniques have to be balanced against the current browser census. Firefox 3 will take some time to adopt, for example.
First, full justification is not merely a matter of mimicry; it's a matter of usability. Ragged edges impose cognitive load, because the eye has to re-discover the edge of every single line. With fully justified text, by contrast, the eye can simply read to the edge of the big, visually-obvious rectangle it's scanning and not have to peer blearily in between two adjacent lines of text for exactly where the current one ends.
Second, the idea that a long word like a URL would cause only the previous line to gain wide spaces is a signal that a terribly, terribly primitive paragraph-breaking algorithm is being used, and that — in my opinion — someone is being allowed to write presentation software who doesn't know the field. Very fast algorithms have existed since the 1970s that do not merely find good ways to break a paragraph into lines, but will be guaranteed to find the best way to break the entire paragraph — so that the choice of where to break the very first line can "feel" the needs of the last lines to accommodate a long URL. The entire paragraph should be spaced a bit wider to help the last line; only a very poor algorithm punishes the line right before it.
Third, soft hyphens should not be a stopgap measure, but are the optimum final solution, because hyphenation is a finicky enough beast that it cannot finally be simply left in the hands of the browser to do where it thinks it will work. Variations of language, proper names that look like normal words, and other issues make it impossible that browser-based hyphenation could ever work very well. Hyphenation should be left entirely in the hands of content producers, whose software should, under their watchful eye, produce documents with soft hyphens at every possible and appropriate point of hyphenation. It's something that needs to be in the content, not guessed at presentation time.
You are also right that better line wrapping exists, but not in the browser. Again, I'm talking here about how to best use the technology we currently have. And I'm not sure even Knuth's algorithm could deal well with the affront of a 15-em URL in the middle of a paragraph!
Which is not to say the server shouldn't dictate what words can/cannot be hyphenated. As Brandon points out, there are nuances that clients may not be able to account for with a general, all-purpose algorithm. (That said, I suspect a reasonable hyphenation algorithm + dictionary would work just fine in all but the most esoteric cases).
What is needed here is a more stylistic solution to the problem, something akin to how CSS works. Instead of embedding soft-hyphens in the content, the server should provide hyphenation rules and allow the client to determine how those rules apply to the content.
For example, in reading about how OpenOffice does hyphenation, it appears that there is a fairly standard algorithm by Franklin Liang (1982) that forms the basis of hyphenation in most free software. This, combined with the per-language hyphenation dictionaries that OpenOffice uses, would seem to be a good basis for hyphenation logic in clients, and would address 99% of the cases where hyphenation is necessary. (Note: the per-language hyphenation dictionaries are surprisingly small. ~60KB per language, fwiw.)
In the cases where websites wanted to further refine how hyphenation took place, they could provide one or more custom dictionaries in the form of <LINK> tags, like this:
<link rel="hyphenation" type="text/plain" href="neds-hyphenation-rules.txt">
I won't try to define the format of these hyphenation dictionaries (although the OpenOffice ".hlab" format seems an obvious choice). The main point is that this has several benefits:
1. Separates hyphenation logic from content (w00t!)
2. Even in the worst case scenario - where the client has no default algorithm/dictionaries, this approach is more efficient. The server need only provide one rule per word that appears in a document, instead of one <shy> entity per occurance of each word.
3. More efficient still... the clients can/will cache the dictionary files
4. And more efficient still - servers can rely on clients to perform the (potentially CPU intensive) hyphenation logic.
Add a comment: