BOM synchronicity

Saturday 11 June 2005 — This is 21 years old. Be careful.

It’s funny how things happen. I was at the Boston Python Meetup the other night, and one of the things we got talking about was the intricacies of Unicode, including the Byte Order Mark (BOM). The BOM is a “character” in the Unicode standard, U+FEFF. It doesn’t render as anything (it is considered a Zero-Width Non-Breaking Space, or ZWNBSP). It’s purpose in life is to be a tell-tale indicator of the endianness of a UTF-16 text file.

If you read the first two bytes of a Unicode file, and they are 0xFF 0xFE, then you know that it is UTF-16, little-endian (low-order byte first). If they are 0xFE 0xFF, then you know it is UTF-16, big-endian.

So we were talking about this Thursday night, and about how funky things can happen and to diagnose them correctly, you have to grok all this BOM stuff (not to mention other things like UTF-8, UTF-16, and so on).

OK, so the next day, I’m setting up a Remote Desktop Connection on my Windows box, and the video settings offer me 1280×1024 or full-screen. But I want 1400×1050. I figure I’ll create the .rdp file at 1280×1024, then open it up, find the display resolution and change it. I open the .rdp file, and what do you know, it’s text! I edit the text, save the file, double-click the .rdp file, and nothing happens. Huh?

Then the previous night’s conversation comes back to me. Sure enough, when I hexdump the edited .rdp file and an unedited one, the original has a BOM, and the edited one does not. Both are legit UTF-16, both are little-endian, but the one with the BOM works, and the one without does not. Now that is not proper Unicode support, but at least I understand what went wrong. I open the file again in TextPad, use the document properties to instruct it to write the BOM, save the file again, and everything works gloriously.

Ah, Unicode. It makes life so much simpler, doesn’t it?

Comments

wm_eddie 12:56 PM on 12 Jun 2005

UTF-16/UTF-32 need to DIE! They need to die horrible firey deaths.

uriel 10:11 AM on 13 Jun 2005

The BOM is one of the worst ideas any standard committee has ever come up with.
UTF-16 is already bad enough without having it allow (optionally!) for
different endianes.

The worst is the people that are so mentally perturbed by having to use
UTF-16(and programs that use it), and decided to add BoMs to UTF-8, which has
no need for it, and which breaks every sane UTF-8 app out there.

Will Rickards 3:40 PM on 13 Jun 2005

I noticed the BOM in udl files when I was writing a wrapper to create them. I had no idea what it was but without it the file was just not recognized as a valid universal data link file. I hadn't noticed it at first because my usual editor, vim, wasn't showing anything to me. I had to look at the file converted to hex to see it.

Brian K. White 3:49 AM on 11 Feb 2006

Just discovered this myself while falsely assuming I could replace the nice web page login buttons to lot's of my customers boxes, which are merely links to .vnc files, with equivalent .rdp files (after setting up the new port-forwarding in the routers of course).
Bzzt! nope.

Anyone know if it's possible (without a lot of manual gyrations on the client, for vnc, I just have them install tightvnc, whose installer makes the file association for .vnc files for you.)
to put a .rdp file on a web page and have it work? I'm assuming, if it's possible at all, that you need something in mime.types on the server, but what?

The vnc files work so nice. I even have a cgi that overwrites .vnc files from template blah.vnc.src files where it inserts the correct IP address every time a wget command in windows scheduler on the customers box hits the cgi. It's reliable, lightweight, doesn't need dydns.org, beautiful.

Yay for plain text config files! Stupid me I THOUGHT I could do the same thing with rdp files.
So often the people making up new stuff just don't get it.
So much basic sanity that was all figured out long ago just gets forgotten every year with every new crop of IT graduates.
Or maybe it's just M$ that never gets it...

Brian K. White 4:02 AM on 11 Feb 2006

Woo! Awsome!
if you put this in mime.types:
appication/mstsc rdp

then you can put rdp files on the web page and they work.

Better yet, you can convert the utf8/16 to ascii and remove lines you don't want hard coded like password and screen size, and it still works!

There are unix utils to convert from utf8/16 to ascii, iconv & recode to name 2. But my sco open server of iconv didn't have utf* conversion files, and recode... is weird and rather than spend a half hour deciphering it's man page (and I LIKE man pages usually) I just loaded the original file in EditPadLite http://www.editpadpro.com/ on windows and it has a convert/unicode/utf8->ansi option that worked. On unix you might have to take care to preserv the dos CRLF line endings.

So, my cgi that reads plain ascii template files and runs sed on them and spits out plain ascii files, works for rdp too.

BOM synchronicity

Comments

Add a comment: