Whitespace in Ruby and searching for code

Tuesday 27 July 2010

Armin's post about Whitespace sensitivity in Ruby piqued my interest. It points out that in Ruby, foo[42] is different than foo [42] and that foo/bar is the same as foo / bar but different than foo /bar.

So I wanted to learn more about Ruby, and looked at a bunch of tutorials, finally ending up at Mitch Fincher's Ruby Tutorial with Code Samples, which had the right breezy pace with no, "a variable is like a box for your numbers" stuff in it.

But I had originally gotten to Mitch's page from a Google search for ruby puts gets. If you try it, you'll see that when you get to Mitch's page, a small box appears near the top, saying,

Welcome. You seem to have come here from a search engine. Your search words (ruby puts gets) are highlighted on this page for your reading pleasure.

I thought "nice," then I thought, "that looks familiar," then I realized it was almost exactly the box that appears at the top of my pages when you visit from a search engine (try it: batchelder white house adventure). In fact, it used the same colors. I looked at his page, and it used near-verbatim copies of my three Javascript files, though a few years ago I consolidated them into one.

I was amused, and wondered where else the code is being used. But the search engines are smart enough not to index comments in Javascript files, or names of Javascript files referenced in HTML pages, unless there's some tricky syntax I don't know about.

PS: about whitespace sensitivity: I've decided that phrase means a programming language needs tokens consisting of only whitespace in order to be parsed properly. Python and Ruby are whitespace-sensitive, and C is not, for example.

Comments

[gravatar]
Blake Winton 9:20 AM on 28 Jul 2010

C isn't? Does that mean that "*a / *b; // */" would be the same as "*a/*b; // */" in C?

Later,
Blake.

[gravatar]
Ned Batchelder 9:45 AM on 28 Jul 2010

My definition was that a language was whitespace sensitive if after lexical analysis, there were tokens consisting entirely of whitespace characters. The fact that whitespace is needed to separate tokens isn't interesting, after all, "int i = 9" is different than "inti=9" too.

In your example, "/ *" is tokenized as "/", "*", and "/*" is the start of a comment, but there are no tokens that are purely whitespace.

[gravatar]
Blake Winton 10:05 AM on 28 Jul 2010

Okay, I think I get it now. Thanks for the clarification!

Later,
Blake.

[gravatar]
Neil Smithline 10:59 PM on 24 Nov 2012

OK, I'm a bit late -- 2 years late, but I think this is an interesting way of defining a whitespace sensitive v. whitespace insensitive language. That said, Python still seems a bit funky to me.

Extending your idea, leading whitespace in Python cannot be collapsed into a single whitespace token as different indentations affect Python semantics.

To top it off, whitespace within a line (?I think in all cases?) as well as whitespace at the end of a line are ignored.

[gravatar]
Mitch Fincher 2:00 AM on 26 Jan 2014

Hi Ned,
Thanks for writing the hilite javascript! I saw you use it on one of your pages. A little reading of your code showed me how you did it. I was thunderstruck at the time. I had always modified code on the server-side in perl and pushed it down to the browser. Your code was doing it on the ... gulp ... client side. It was a revelation at the time. I have been using the code for years now.
Cheers, Mitch

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.