« | » Main « | »

Homer and Bush in CSS

Tuesday 29 April 2008

Román Cortés has done an amazing thing. He's made portraits of Homer Simpson and George Bush. Here's Homer:

o o o o ( O O O \ L ( O O O O O \ L ( O | | \ \ | | \ \ \ \ ( ( 8 o o o ( ( 8 o o o o ) ) b o O o o o o o o ) b o O o o o o o o o o o / / / • • • • • _ _ _ • • • C C O ( -

and here's Bush:

o o o o o o o l o ´ ´ ` ) ) ( · o ` - - - · · o o - / 0 / - ( o o ` ` ( ( o \ ´ o o o o ` 0 ( \ - ` - · · o ( 0 0 ~ o o o o 0 0 0 0 ( ` ( o o o o o o o o - ‘ - 0 0 o o o o o o · • O ´ o o ` / · ( · ´ ) ` \ · · o o 0 0 • ` ` • / / - - - - o 0 o o o o o o o o - - - - • • • o o - • • • • • • ´ - - ( \ ( o o • • • • • ) • • • • • • • • • • • / / • ` • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ` ) ´ ` ` ` ` • • ´ ´ ´ ´ l `

Wait, those won't look right without the proper CSS styling. Here they are as they are meant to be seen:

Homer Simpson George Bush

Yes, those really are characters styled to position them correctly to make the images. Go look at the HTML source on the original HTML pages to see for yourself.

This is oddly reminiscent of a similar Simpsons-themed artwork: Google groups ascii art: Bart Simpson.


Tuesday 29 April 2008

This has been around for a long time, but I'd never heard of it: ReactOS is an open-source re-implementation of Windows. I guess it's possible to be fanatically devoted to both Windows and open source. As mammoth a task as this sounds, it seems they are making progress. Although they've been at it for about ten years, they have screenshots of working code, and an active subversion repository (they're working on DirectX support now).

I wonder what the future will hold for ReactOS. Microsoft will continue to build Windows, widening the gap between what Windows and ReactOS are, though if the reaction to Vista is any indication, perhaps XP and its ReactOS clone will be considered the golden age of Windows. At the same time, anti-Microsoft sentiment will continue to build among the open source community, either in the pro-Linux or pro-Apple flavor.

Non-transitive dice

Thursday 24 April 2008

I'm still trying to wrap my head around this. Non-transitive dice are four dice and a game to go with them, where each die beats the next in line, and the last beats the first. Each can be shown to be better than the next, but somehow it keeps going in a cycle, never reaching an all-around best die. Kind of like a quantitative rock-paper-scissors, reminiscent of Escher's Ascending and Descending.

Wikipedia has more on these dice.

Separating sentences

Saturday 19 April 2008

One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.

The preliminaries: these are Django filters, but mostly they're just string functions, wrapped with Django decorators to make them useful in Django templates.

Here are two helpers:

def inner_html(value):
    """ Strip off the outer tag of the HTML passed in.
    if value.startswith('<'):
        value = value.split('>', 1)[1].rsplit('<', 1)[0]
    return value

def first_par(value):
    """ Take just the first paragraph of the HTML passed in.
    return value.split("</p>")[0] + "</p>"

These functions are pretty simple, meant to operate on simple inputs. For example, first_par assumes that the opening tag of the HTML is <p>.

Splitting sentences is fairly tricky. I tried searching for a Python snippet, which I didn't find. I tried thinking about regexes that could do it, but the rules are too complicated. In the end, the code structure I could understand was to break the text into words, and then add words one at a time to a potential sentence, checking it for sentence-hood.

Here's the rules I came up with for something being a sentence:

  • The end of the sentence must be punctuation (.!?), possibly with closing parens and/or double-quote after it.
  • The next chunk of text has to start with an upper-case letter or number, possibly with an opening paren and/or double-quote preceding it.
  • The sentence can't end with "Mr." or titles like it, or an initial. This is to keep the previous two rules from splitting sentences like "Hello Mr. John Q. Public!" incorrectly in the middle.
  • The sentence needs to have balanced parens and double-quotes. This ensures that sentences breaks won't be identified in quoted material (or parenthetical asides).

These rules seem to work well for picking out the first sentence from each of my 1800-odd blog posts. Here's the code:

def first_sentence(value):
    """ Take just the first sentence of the HTML passed in.
    value = inner_html(first_par(value))
    words = value.split()
    # Collect words until the result is a sentence.
    sentence = ""
    while words:
        if sentence:
            sentence += " "
        sentence += words.pop(0)
        if not re.search(r'[.?!][)"]*$', sentence):
            # End of sentence doesn't end with punctuation.
        if words and not re.search(r'^[("]*[A-Z0-9]', words[0]):
            # Next sentence has to start with upper case.
        if re.search(r'(Mr\.|Mrs\.|Ms\.|Dr\.| [A-Z]\.)$', sentence):
            # If the "sentence" ends with a title or initial, then it probably
            # isn't the end of the sentence.
        if sentence.count('(') != sentence.count(')'):
            # A sentence has to have balanced parens.
        if sentence.count('"') % 2:
            # A sentence has to have an even number of quotes.
    return sentence

This is coded not for speed but for being able to see what it does and add new clauses as I find broken sentences. The candidate sentence starts out empty. Words are appended to it one at a time, and the sentence checked against the rules. If any rule is violated, we continue to the next word. If all the rules pass, we break out of the loop and return the found sentence.

I know this code isn't perfect. Here are some things it doesn't do well:

  • Sentences with single-quote quotes, because just counting them isn't sufficient. Apostrophes and single-quotes make it so that the number isn't always even.
  • Text with curly quotes.
  • Sesame Street sentences: "This blog brought to you by the letter B."
  • Sentences about punctuation, or with code in them.

Actually, there are lots of cases that will not be handled well. Word-play enthusiasts I'm sure will enjoy coming up with examples.


Thursday 17 April 2008

Command-Shift-3 is a website for showcasing web page designs and pitting them against each other in HotOrNot-style face-offs. The design and tone of the site are fun, and they provide interesting ways to surf around: leaders right now, all-time best, worst ever, by tag (orange), and battle by tag (cat).

In addition to the snarky fun of voting, you can see some good stuff roll by. From it, I found MisterPresident, a site designed by Khoi Vinh for his dog.

Playing with this site last week helped energize me to do my own redesign, which as of this writing has managed one win, one loss, and one draw in competition.

PS: for the non-Mac users out there: command-shift-3 is the OSX shortcut for capturing a screenshot.

New home page

Tuesday 15 April 2008

For the last six years, the home page of this site has been pretty plain. Mostly it was my name, the star logo, and a few links off to the rest of the site. Here's how it started in 2002.

I finally got around to doing something more with it. Check out the new design. It is still just a jumping-off point for the rest of the site, but improved. The design has more energy, and there's more content there to get you started: one-sentence exceprts from the last four blog posts, a quick tag summary for the blog, three featured code projects, three featured text articles, a bit about me, and a search bar.

My design aesthetic has always been strongly typographic, with little or no color. In designing this site, I've looked at the spectrum of other similarly designed sites. At one end are the minimalists, currently epitomised by Ryan Tomayko. His site is about as stripped down as you can get and still care whether people can find stuff or not. When he recently redesigned, I was amazed to see how spartan pages could be.

At the other end are structuralists like Khoi Vinh at Subtraction. His site is impressive for its disciplined use of an eight-column grid for design, and for the way every page is packed with tabular information about the site and about the information itself. Vinh goes overboard in some places. Do blog posts really need a table up top where the date is labeled "Posted", the author is labeled "Author" and the post itself is labeled "Body"? I think people will understand what the information signifies without the labels.

In the middle are sites like goodonpaper.org and rc3.org, both of which continue the mostly black type on white background theme.

I've tried to find my own middle ground, with generous navigation and metadata, but identified implicitly where possible, so as not to beat people over the head with it. For example, at first I had two separate lists of blog tags and blog archives on the home page. Then I realized that tags and years are both different ways of slicing blog posts, so I combined them into one list, but with a bullet between them. Then I removed even that distinction and didn't label it at all beyond "Blog". People will get it.

One way that my site differs from all four exemplars is that they all use the home page as their blog, or at least an extensive table of contents for their blog. I've always felt that my blog is just one part of the whole site, and so the home page shouldn't be just a proxy for the blog.

There's something very satisfying about being able to focus on one small thing, in this case, a single page, even just a single screenful, and polishing it. I looked at examples, thought about navigation, and considered graphics. I fiddled with the typography, adjusted the content, and played with the punctuation. I'm not a designer, so I'm sure someone else could have done it better, but it's my work, and I like the result a lot.

Watchdog.net and EveryBlock

Tuesday 15 April 2008

Aaron Swartz has a new venture, watchdog.net. It's an open endeavor to build a hub for politics on the Internet. It's part of a growing trend, data-driven sites to mine and use the structured data that is becoming more prevalent on the Internet every day. Another is Adrian Holovaty's EveryBlock ("a news feed for your block").

It'll be interesting to see how these play out. Making beautiful browsable interfaces to rich data is one thing; making them useful beyond surfing for trivia is another matter. It's hard to imagine two guys better suited to do it than Aaron and Adrian. I think great things could happen here.

Pixar and Disney line-up through 2012

Monday 14 April 2008

Here's the complete line-up of Disney and Pixar animated movies for the next four years. Things I noticed:

  • Pretty soon, everything's going to be 3-D.
  • Some movies are Disney and Pixar, and some are just Disney. Rapunzel is a Disney-only CG made without Pixar? Interesting.
  • There are plenty of sequels. Well, only two, but Pixar has six titles here, so a third of them are sequels. And there's the re-release of the Toy Stories in 3-D, which makes it feel like eight movies, four of which are sequels or re-hashes.

I still haven't seen Ratatouille, and my kids have, so I've got no one to see it with. I won't make that mistake with Wall-E...

Big dog robot and big dog beta

Thursday 10 April 2008

Bob is right: Boston Dynamics' Big Dog robot is very, very cool (watch the guy try to kick it over), and Big Dog beta is very, very funny.

This video must be going around. The other night, I was at Dan Bricklin's Tech Tuesday. There was a screen running there with videos of all sorts of technology, which people were not paying direct attention to, but it made for a good background attraction in a go-go dancer sort of way. One of the things I saw out of the corner of my eye was the Big Dog video.

The structure of .pyc files

Wednesday 9 April 2008

I spent some time digging around in the Python code to understand how .pyc files work. It turns out they are fairly simple, then kind of complex.

At the simple level, a .pyc file is a binary file containing only three things:

  • A four-byte magic number,
  • A four-byte modification timestamp, and
  • A marshalled code object.

The magic number is nothing as cool as cafebabe, it's simply two bytes that change with each change to the marshalling code, and then two bytes of 0d0a. The 0d0a bytes are a carriage return and line feed, so that if a .pyc file is processed as text, it will change, and the magic number will be corrupted. This will keep the file from executing after a copy corruption. The marshalling code is tweaked in every major release of Python, so in practice the magic number is unique in each version of the Python interpreter. For Python 2.5, it's b3f20d0a. (the gory details are in import.c)

The four-byte modification timestamp is the Unix modification timestamp of the source file that generated the .pyc, so that it can be recompiled if the source changes.

The entire rest of the file is just the output of marshal.dump of the code object that results from compiling the source file. Marshal is like pickle, in that it serializes Python objects. It has different goals than pickle, though. Where pickle is meant to produce version-independent serialization suitable for persistence, marshal is meant for short-lived serialized objects, so its representation can change with each Python version. Also, pickle is designed to work properly for user-defined types, while marshal handles the complexities of Python internal types. The one we care about in particular here is the code object.

The nature of marshalling gives us the important characteristics of .pyc files: they are independent of platform, but very sensitive to Python versions. A 2.4 .pyc file will not execute under 2.5, but it can be copied from one operating system to another just fine.

So that's the simple part: two longs and a marshalled code object. The complexity, of course, is in the structure of the code object. They contain all sorts of information produced by the compiler, the meatiest of which is the bytecode itself.

Luckily it isn't hard to write a program to dump these things out, thanks to the marshal and dis modules:

import dis, marshal, struct, sys, time, types

def show_file(fname):
    f = open(fname, "rb")
    magic = f.read(4)
    moddate = f.read(4)
    modtime = time.asctime(time.localtime(struct.unpack('L', moddate)[0]))
    print "magic %s" % (magic.encode('hex'))
    print "moddate %s (%s)" % (moddate.encode('hex'), modtime)
    code = marshal.load(f)
def show_code(code, indent=''):
    print "%scode" % indent
    indent += '   '
    print "%sargcount %d" % (indent, code.co_argcount)
    print "%snlocals %d" % (indent, code.co_nlocals)
    print "%sstacksize %d" % (indent, code.co_stacksize)
    print "%sflags %04x" % (indent, code.co_flags)
    show_hex("code", code.co_code, indent=indent)
    print "%sconsts" % indent
    for const in code.co_consts:
        if type(const) == types.CodeType:
            show_code(const, indent+'   ')
            print "   %s%r" % (indent, const)
    print "%snames %r" % (indent, code.co_names)
    print "%svarnames %r" % (indent, code.co_varnames)
    print "%sfreevars %r" % (indent, code.co_freevars)
    print "%scellvars %r" % (indent, code.co_cellvars)
    print "%sfilename %r" % (indent, code.co_filename)
    print "%sname %r" % (indent, code.co_name)
    print "%sfirstlineno %d" % (indent, code.co_firstlineno)
    show_hex("lnotab", code.co_lnotab, indent=indent)
def show_hex(label, h, indent):
    h = h.encode('hex')
    if len(h) < 60:
        print "%s%s %s" % (indent, label, h)
        print "%s%s" % (indent, label)
        for i in range(0, len(h), 60):
            print "%s   %s" % (indent, h[i:i+60])


Running this on the .pyc from an ultra-simple Python file:

a, b = 1, 0
if a or b:
    print "Hello", a

produces this:

magic b3f20d0a
moddate 8a9efc47 (Wed Apr 09 06:46:34 2008)
   argcount 0
   nlocals 0
   stacksize 2
   flags 0040
  1           0 LOAD_CONST               4 ((1, 0))
              3 UNPACK_SEQUENCE          2
              6 STORE_NAME               0 (a)
              9 STORE_NAME               1 (b)

  2          12 LOAD_NAME                0 (a)
             15 JUMP_IF_TRUE             7 (to 25)
             18 POP_TOP
             19 LOAD_NAME                1 (b)
             22 JUMP_IF_FALSE           13 (to 38)
        >>   25 POP_TOP

  3          26 LOAD_CONST               2 ('Hello')
             29 PRINT_ITEM
             30 LOAD_NAME                0 (a)
             33 PRINT_ITEM
             34 PRINT_NEWLINE
             35 JUMP_FORWARD             1 (to 39)
        >>   38 POP_TOP
        >>   39 LOAD_CONST               3 (None)
             42 RETURN_VALUE
      (1, 0)
   names ('a', 'b')
   varnames ()
   freevars ()
   cellvars ()
   filename 'C:\\ned\\sample.py'
   name '<module>'
   firstlineno 1
   lnotab 0c010e01

A lot of this stuff I don't understand, but the byte codes are nicely disassembled and presented symbolically. The Python virtual machine is a stack-oriented interpreter, so a lot of the operations are loads and pops, and of course jumps and conditionals. For the adventurous: the byte-code interpreter is in ceval.c. The exact details of the byte codes change with each major version of Python. For example, the PRINT_ITEM and PRINT_NEWLINE opcodes we see here are gone in Python 3.0.

In the disassembled output, the left-most numbers (1, 2, 3) are the line numbers in the original source file and the next numbers (0, 3, 6, 9, ...) are the byte offsets of the instruction. The operands to the instruction are presented numerically, and then in parentheses, interpreted symbolically. Lines with ">>" are the targets of jump instructions somewhere else in the code.

This sample was very simple, with a single code object for the flow of instructions in the module. A real module with class and function definitions would be more complicated. The classes and functions would themselves be code objects in the consts list, nested as deeply as needed to represent the module. The module code object has class code objects which themselves have function code objects, and so on.

Once you start digging around at this level, there are all sorts of facilities for working with code objects. In the standard library, there's the compile built-in function, and the compiler, codeop and opcode modules. For the truly adventurous, there are third-party packages like codewalk, byteplay and bytecodehacks. PEP 339 gives more detail about compilation and opcodes. Finally, Ananth Shrinivas had another take on exploring Python bytecode.


Monday 7 April 2008

FontStruct is a pixel-font editor on steroids. On a simple grid, you construct a font. But because the shapes available at each grid point are not simple on/off squares, the range of possibilities is much greater. The gallery shows a number of surprisingly expressive faces.

Zelda treasure cake

Monday 7 April 2008

For Ben's tenth, a Legend of Zelda cake. It's a treasure chest full of symbols from the game. Max was chief Zelda consultant, guiding us on which symbols were important, which tangential. Included are a sword, shield (of some particular design), a compass, the triforce, and rupees to fill in the gaps.

The lid of the box (with the writing on it) is a slab of brownies!

As usual, Susan has the full story.

Tabblo: Legend of Zelda Treasure Chest Cake

Aptus 1.55 and wx buffered drawing

Saturday 5 April 2008

After a pair of releases of Aptus (my Mandelbrot viewer) last weekend, Rob McMullen wrote to me with a patch to eliminate flickering while drawing on Windows. I was thrilled. I've been hacking on side projects with wxPython for a few years now, and the intricacy of paint events, and the vagueness of the double-buffering docs, have conspired to leave me feeling uncertain. When the code works, I sometimes don't understand why.

So I was psyched for Rob's patch. It worked great, the Windows flicker was gone. But on the Mac, nothing was drawing at all! I tried a few simple tweaks, but they didn't help. The previous code had worked without flickering on the Mac, so in the end I used the old code on the Mac and the new code on Windows and Linux. It's not a nice solution, but again, I don't know why it's behaving the way it is, so it's hard for me to track down.

Aptus 1.55 is now posted, and it doesn't flicker on any of the three platforms, but the code still has the ad-hoc platform check in it. If anyone out there knows how this stuff works, and wants to help educate a poor sinner, take a look at the code and drop me a line. I'll be forever grateful.

Design pranks

Thursday 3 April 2008

I know I'm a few days late, but I liked these two design-related jokes:

  • A Ford re-branding which was well balanced on the lame/professional line, enough so to believe that (horrifyingly) maybe it was true. In this case, it was a student's class project. Good PR for him.
  • The Serif announced Helvetica Serif, based on drawings by Max Miedinger. The joke was poorly executed, though, since the sample shown is nothing more than Times Bold. In fact, Google coughs up an actual attempt at Helvetica Serif, which is at least different enough to have sparked some outraged reactions.

Some people think April Fools' jokes suck (though is that post itself a joke? hard to say). But I like them because the best ones work by exploring the edge of plausibility. They are witty "what if?" questions. Could Ford really have tossed out their century-old brand? Other companies have made moves that bold/stupid. When a joke like this works, it's because it's in the twilight zone between obvious and impossible.

« | » Main « | »