|Ned Batchelder : Blog | Code | Text | Site|
Displaying unicode in windows prompts
» Home : Blog : January 2004
Good question. I looked into it, and I might be using one already: the Windows prompt. It seems to have support for UTF-8, but darned if I can figure out for sure.
Windows has a command called "chcp" to change the active code page. The help isn't clear whether this changes how the prompt interprets bytes for display, or how the built-in commands generate bytes for output.
The default on my laptop is code page 437, and when I print unusual characters, I often see the line-drawing characters instead. Somewhere I read that code page 65001 is UTF-8, and chcp will accept 65001 as a choice, but Python behaves oddly when I set it:
>>> u = u'\xab\x6b\xfc\xdf\xee\xa7\xf6\x66\x74\xbb'
It seemed to detect the code page automatically, but never heard of 65001. When I try to force it to utf-8, it shows the kind of line noise that indicates the display isn't interpreting utf-8.
Worse, just using built-in commands doesn't seem to work properly. I created two directories, one with upper-half latin1 characters, and one with true two-byte Unicode characters. In my main prompt window, the "dir" command shows both strings properly no matter what the active code page is. Why is that?
With a new command window started with no customizations, I can't figure out what it's doing: the code page starts as 437, and it displays kind of what you'd expect (although one of the characters it is displaying isn't in code page 437!):
After setting the code page to 65001, everything is crap. I'd show you, but although there are line drawing characters in Unicode, it's too much of a pain trying to use them to faithfully reproduce what I am seeing.
Somewhere there's a great page that shows the different kinds of gibberish you encounter while doing this sort of debugging, and diagnoses what's going wrong. Jukka Korpela's Tutorial on Character Set Issues may be what I am thinking of, though I thought it had longer examples.