Printing Unicode from Python

Tuesday 13 January 2004This is 21 years old. Be careful.

So if I have Unicode strings in Python, and I print them, they get encoded using sys.getdefaultencoding(), and if that encoding can’t handle a character in my string, I get a UnicodeEncodeError. Can I set things up so that the encoding is done with ‘replace’ for errors rather than ‘strict’? As it is, I use a function instead of print:

# Safe printing: can print any unicode string
def safeprint(msg):
    print msg.encode(sys.getdefaultencoding(), 'replace')

# blah blah
safeprint(mytrickystring)

Isn’t there a way to set stdout to not care or something?

Comments

[gravatar]
You can set the default encoding to utf-8 like this:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

It's a bit of a hack but it's very usefull for testing.
[gravatar]
But this doesn't give me the right output, because my command prompt window doesn't grok utf-8. I want to see strings in the proper encoding, but with a more relaxed attitude: I'm just printing stuff, do the best you can, don't panic if a character can't be encoded!
[gravatar]
Not clear that you can override the default behavior of print. Here's another article discussing this issue.
[gravatar]
I understand what you want now. Not sure if 'print' can be rebinded.

Why don't you use a terminal emulator with utf-8 encoding support?
[gravatar]
How about this:

import sys
import codecs

writer_class = codecs.getwriter(sys.getdefaultencoding())
sys.stdout = writer_class(sys.stdout, 'replace')
q = u'\u00bfHabla espa\u00f1ol?'
print q
[gravatar]
It sounds like Python is doing the wrong thing by default.


By default, when encoding text, an encoder should not raise an exception. Instead, the encoder should try to continue encoding as best it can. ("Replace" is one way to continue.) Why? Because humans are much better at interpreting text than machines are. A machine doesn't know what to do with the error. A human may be able to look at text containing a few '?' characters and still make sense of it.


I think there is a general principle here. When it comes to working with text, its better to defer to human judgment than machine judgment.

[gravatar]
Errors should never pass silently.
Unless explicitly silenced.
[gravatar]
Here's what I've written on the subject in the past:

http://groups.google.com/groups?selm=23891c90.0306060626.24e6646d%40posting.google.com

You could always write a function or make a class which does the right thing for you, but since there's no right thing for everyone, the "go with the flow, man" argument for some kind of magic convenience function or output mode really doesn't hold water. However, I'd like to see better locale support, but then one tends to come up against various platform breakage rather too often to make this a trivial piece of work.
[gravatar]
Java has similar breakage in this regard but in a different way. Some of this may be fixed in JDK 1.5, I'm not sure. When it does a charset conversion is uses '?' as the fallback character. That's fine most of the time but there are cases when you want to override this behavior. For example, when converting text for HTML into a specific charset, a better fallback is to use an HTML entity code (e.g. ♣)
[gravatar]
Peter Blackledge 10:22 AM on 23 May 2017
Thanks Ned, your safeprint() solution is still useful to a newbie in 2017!

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.