Tuesday 27 January 2004 — This is close to 21 years old. Be careful.
In a comment to my posting about printing Unicode from Python, Thijs van der Vossen (who has a nice blog himself) asked why I don’t use a terminal emulator with UTF-8 support.
Good question. I looked into it, and I might be using one already: the Windows prompt. It seems to have support for UTF-8, but darned if I can figure out for sure.
Windows has a command called “chcp” to change the active code page. The help isn’t clear whether this changes how the prompt interprets bytes for display, or how the built-in commands generate bytes for output.
The default on my laptop is code page 437, and when I print unusual characters, I often see the line-drawing characters instead. Somewhere I read that code page 65001 is UTF-8, and chcp will accept 65001 as a choice, but Python behaves oddly when I set it:
>>> u = u'\xab\x6b\xfc\xdf\xee\xa7\xf6\x66\x74\xbb'
>>> print u
Traceback (most recent call last):
File "<stdin>", line 1, in ?
LookupError: unknown encoding: cp65001
>>> print u.encode('utf-8')
..(line drawing gibberish)..
It seemed to detect the code page automatically, but never heard of 65001. When I try to force it to utf-8, it shows the kind of line noise that indicates the display isn’t interpreting utf-8.
Worse, just using built-in commands doesn’t seem to work properly. I created two directories, one with upper-half latin1 characters, and one with true two-byte Unicode characters. In my main prompt window, the “dir” command shows both strings properly no matter what the active code page is. Why is that?
ascii
unicode “ЌύБЇζθ₣†”
upper «küßî§öft»
With a new command window started with no customizations, I can’t figure out what it’s doing: the code page starts as 437, and it displays kind of what you’d expect (although one of the characters it is displaying isn’t in code page 437!):
ascii
unicode "????????+"
upper «küßî§öft»
After setting the code page to 65001, everything is crap. I’d show you, but although there are line drawing characters in Unicode, it’s too much of a pain trying to use them to faithfully reproduce what I am seeing.
Somewhere there’s a great page that shows the different kinds of gibberish you encounter while doing this sort of debugging, and diagnoses what’s going wrong. Jukka Korpela’s Tutorial on Character Set Issues may be what I am thinking of, though I thought it had longer examples.
Comments
Two-byte characters are how most of UTF-16 is encoded (some characters require four bytes), but that is one encoding of Unicode, not Unicode itself. UTF-8 typically takes 1 byte per character if the character falls into the ASCII subset, but up to (IIRC) six bytes otherwise. UTF-32 is an encoding where each character takes four bytes (consistently, I believe).
Tim Bray has a good essay on Unicode and encodings here: here, and he proposes a new encoding for getting entities out of XML here.
I'm not sure about Windows, I think it depends on the version of Windows you're running. On OS X the terminal handles UTF-8 just fine.
Good luck!
--Dethe
/A Causes the output of internal commands to a pipe or file to be ANSI
/U Causes the output of internal commands to a pipe or file to be Unicode
get readable output. Setup it in properties.
i want to take advantage of fast printing from dos-prompt (command Prompt) for unicode fonts , so i want to create an app which prints unicode font or say Devnagari (Indian) codes to printer from command propt
Help Me
Add a comment: