So if I have Unicode strings in Python, and I print them, they get encoded using sys.getdefaultencoding(), and if that encoding can't handle a character in my string, I get a UnicodeEncodeError. Can I set things up so that the encoding is done with 'replace' for errors rather than 'strict'? As it is, I use a function instead of print:

# Safe printing: can print any unicode string
def safeprint(msg):
    print msg.encode(sys.getdefaultencoding(), 'replace')

# blah blah
safeprint(mytrickystring)

Isn't there a way to set stdout to not care or something?

tagged: , » 9 reactions

Comments

[gravatar]
Thijs van der Vossen 6:57 AM on 13 Jan 2004

You can set the default encoding to utf-8 like this:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

It's a bit of a hack but it's very usefull for testing.

[gravatar]
Ned Batchelder 7:06 AM on 13 Jan 2004

But this doesn't give me the right output, because my command prompt window doesn't grok utf-8. I want to see strings in the proper encoding, but with a more relaxed attitude: I'm just printing stuff, do the best you can, don't panic if a character can't be encoded!

[gravatar]
Bob 8:42 AM on 13 Jan 2004

Not clear that you can override the default behavior of print. Here's another article discussing this issue.

[gravatar]
Thijs van der Vossen 9:07 AM on 13 Jan 2004

I understand what you want now. Not sure if 'print' can be rebinded.

Why don't you use a terminal emulator with utf-8 encoding support?

[gravatar]
Graham Fawcett 9:14 AM on 13 Jan 2004

How about this:

import sys
import codecs

writer_class = codecs.getwriter(sys.getdefaultencoding())
sys.stdout = writer_class(sys.stdout, 'replace')
q = u'\u00bfHabla espa\u00f1ol?'
print q

[gravatar]
Doug Sauder 7:58 PM on 13 Jan 2004

It sounds like Python is doing the wrong thing by default.


By default, when encoding text, an encoder should not raise an exception. Instead, the encoder should try to continue encoding as best it can. ("Replace" is one way to continue.) Why? Because humans are much better at interpreting text than machines are. A machine doesn't know what to do with the error. A human may be able to look at text containing a few '?' characters and still make sense of it.


I think there is a general principle here. When it comes to working with text, its better to defer to human judgment than machine judgment.

[gravatar]
Harald Armin Massa 5:17 AM on 14 Jan 2004

Errors should never pass silently.
Unless explicitly silenced.

[gravatar]
Paul Boddie 7:16 AM on 14 Jan 2004

Here's what I've written on the subject in the past:

http://groups.google.com/groups?selm=23891c90.0306060626.24e6646d%40posting.google.com

You could always write a function or make a class which does the right thing for you, but since there's no right thing for everyone, the "go with the flow, man" argument for some kind of magic convenience function or output mode really doesn't hold water. However, I'd like to see better locale support, but then one tends to come up against various platform breakage rather too often to make this a trivial piece of work.

[gravatar]
Bob 9:20 AM on 14 Jan 2004

Java has similar breakage in this regard but in a different way. Some of this may be fixed in JDK 1.5, I'm not sure. When it does a charset conversion is uses '?' as the fallback character. That's fine most of the time but there are cases when you want to override this behavior. For example, when converting text for HTML into a specific charset, a better fallback is to use an HTML entity code (e.g. ♣)

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.