Cog 2.1 and newline detection

Saturday 24 May 2008

Since working full-time in Python, I haven’t needed to use my code generator Cog much, but Alexander Belchenko has. He’s prodded me to add one more feature to it, and graciously and pro-actively kept the Russian docs up to date.

The new feature is a way to get Unix line endings in the output file, even when running on Windows. When Alexander first brought this up, my inclination was to change the code so that the line ending style of the input file would determine the style of the output file. This has a certain elegance and symmetry. It would mean that a Windows file with \r\n endings could be cog’ged on Unix, and the output file would have \r\n endings.

In Python, if you open a file in ‘rU’ mode, it is treated as a text file, and all data is presented with \n line ending, but the file object has a newlines property which is a string or tuple of all the line ending styles seen in the file. This seemed perfect for my needs. As the output file was being written, it could examine the newlines property of the input file to determine what style endings to write. I was willing to ignore the engineer’s obsessive corner case of a file with mixed line endings, and simply say that if a \r\n had been encountered in the input, the lines would be written with \r\n, otherwise, they would get \n.

Alas, this didn’t quite work out. Turns out that after reading one line from a Windows file, newlines has no information in it:

>>> f = open('sample.txt', 'rU')     # open the file...
>>> f.newlines                       #  ..nothing in newlines yet
>>> f.readline()                     # read the first line...
'This is the first line\n'
>>> f.newlines                       #  ..still nothing in newlines!
>>> f.readline()                     # read the second line...
'This is the second line\n'          
>>> f.newlines                       #  ..*now* something in newlines :(
'\r\n'

As a result, my code worked great, except that the first line of output always ended with a \n, while the rest of the file followed the lead of the input file.

Fixing that would have meant re-working a lot of code to buffer everything. It would have been possible, but to gain what? The code as it stands handles the case I really care about: preserving Unix line endings when processing files on Windows. To make that happen, I only had to open the output file in binary mode, since all the internal text handling uses \n endings. Handling the opposite case, preserving Windows endings on Unix, simply wasn’t important enough to warrant the effort.

In any case, thanks Alexander for moving Cog forward!

Comments

[gravatar]
Marius Gedminas 1:40 PM on 24 May 2008

Why not open the input file in binary mode and inspect the actual line ending returned by readline()?

[gravatar]
Ned Batchelder 4:26 PM on 24 May 2008

I could do that, and maybe it wouldn't be that much change to the code, but it just didn't feel worth it. If a user writes in saying it's important to them that Windows line endings be preserved when running cog on Unix, I'll probably do it. Until then, this will suffice.

[gravatar]
Jonathan Pallant 1:43 PM on 26 Sep 2014

Just to let you know, I tripped over this exact issue Cogging files in a Linux VM which were checked out on the Windows host using svn:eol-style native.

[gravatar]
Ned Batchelder 3:28 PM on 26 Sep 2014

@Jonathan: I can't tell from your comment whether you are experiencing a problem, or whether the change described in this post fixed a problem for you?

[gravatar]
Jonathan Pallant 9:33 PM on 26 Sep 2014

>>> If a user writes in saying it's important to them that Windows line endings be preserved when running cog on Unix, I'll probably do it. Until then, this will suffice.

I am indeed running Cog on Unix and wish to preserve Windows line endings. I've fudged it for now by checking for '\r\n' in the file being modified and, if found, changing the value of os.linesep and using it explicitly when adding newlines to the output. But it would be nice if cog.outl() did the right thing automatically.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.