« | » Main « | »

IronPython is weird

Wednesday 15 March 2017

Have you fully understood how Python 2 and Python 3 deal with bytes and Unicode? Have you watched Pragmatic Unicode (also known as the Unicode Sandwich, or unipain) forwards and backwards? You're a Unicode expert! Nothing surprises you any more.

Until you try IronPython...

Turns out IronPython 2.7.7 has str as unicode!

C:\Users\Ned>"\Program Files\IronPython 2.7\ipy.exe"
IronPython 2.7.7 (2.7.7.0) on .NET 4.0.30319.42000 (32-bit)
Type "help", "copyright", "credits" or "license" for more information.
>>> "abc"
'abc'
>>> type("abc")
<type 'str'>
>>> u"abc"
'abc'
>>> type(u"abc")
<type 'str'>
>>> str is unicode
True
>>> str is bytes
False

String literals work kind of like they do in Python 2: \u escapes are recognized in u"" strings, but not "" strings, but they both produce the same type:

>>> "abc\u1234"
'abc\\u1234'
>>> u"abc\u1234"
u'abc\u1234'

Notice that the repr of this str/unicode type will use a u-prefix if any character is non-ASCII, but it the string is all ASCII, then the prefix is omitted.

OK, so how do we get a true byte string? I guess we could encode a unicode string? WRONG. Encoding a unicode string produces another unicode string with the encoded byte values as code points!:

>>> u"abc\u1234".encode("utf8")
u'abc\xe1\x88\xb4'
>>> type(_)
<type 'str'>

Surely we could at least read the bytes from a file with mode "rb"? WRONG.

>>> type(open("foo.py", "rb").read())
<type 'str'>
>>> type(open("foo.py", "rb").read()) is unicode
True

On top of all this, I couldn't find docs that explain that this happens. The IronPython docs just say, "Since IronPython is a implementation of Python 2.7, any Python documentation is useful when using IronPython," and then links to the python.org documentation.

A decade-old article on InfoQ, The IronPython, Unicode, and Fragmentation Debate, discusses this decision, and points out correctly that it's due to needing to mesh well with the underlying .NET semantics. It seems very odd not to have documented it some place. Getting coverage.py working even minimally on IronPython was an afternoon's work of discovering each of these oddnesses empirically.

Also, that article quotes Guido van Rossum (from a comment on Calvin Spealman's blog):

You realize that Jython has exactly the same str==unicode issue, right? I've endorsed this approach for both versions from the start. So I don't know what you are so bent out of shape about.

I guess things have changed with Jython in the intervening ten years, because it doesn't behave that way now:

$ jython
Jython 2.7.1b3 (default:df42d5d6be04, Feb 3 2016, 03:22:46)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_31
Type "help", "copyright", "credits" or "license" for more information.
>>> 'abc'
'abc'
>>> type(_)
<type 'str'>
>>> str is unicode
False
>>> type("abc")
<type 'str'>
>>> type(u"abc")
<type 'unicode'>
>>> u"abc".encode("ascii")
'abc'
>>> u"abc"
u'abc'

If you want to support IronPython, be prepared to rethink how you deal with bytes and Unicode. I haven't run the whole coverage.py test suite on IronPython, so I don't know if other oddities are lurking there.

« | » Main « | »