« | » Main « | »

Stack Exchange 2.0

Wednesday 28 April 2010

In the beginning there was Stack Overflow, the programmer's Q&A site. It's been very successful, easily overtaking its competition. It's now the clear best choice for a place to ask questions and look for answers about programming.

After a few months, Stack Overflow spawned Server Fault (for system administration topics) and Super User (for computer user topics). They've been moderately successful, nothing like Stack Overflow, though: they've each accumulated 36,000 questions, while Stack Overflow has 640,000 so far.

Then they figured, why can't we handle any topics at all, and let anyone create their own site? And so Stack Exchange was born, a site where anyone could create a Q&A arena on whatever topic they wanted.

But it wasn't free, in fact, it seemed kind of expensive: $129/month. And the sites weren't taking off. It seemed that each step removed from programming questions meant a 20x drop-off in traffic. The Stack Overflow team (Joel Spolsky, Jeff Attwood and a bunch of others) are now looking for ways to extend their success, including getting some investment.

As part of their new plan, they've announced changes to Stack Exchange: Stack Exchange 2.0. Everything's free now, but the process for creating new sites has become as convoluted as a Politburo meeting. Interestingly, the comments on the announcement are mostly mad about the loss of the paid option, because a paid site is owned by its creator, while the free sites are not.

I think the new community creation process is way too heavyweight, especially where they require a certain number of users with a certain number karma points on existing sites to commit to a community before it will be created.

Overall, it's a familiar internet story: a startup creates something, people start using it, but then the business plan shifts, and users are left feeling abandoned. Small startups have to adapt to survive, but they don't want to piss off too many people along the way.

One interesting point in this whole thing: Joel has been very direct with people, telling them if they think they can do a better job building a community, they're welcome to use one of the Stack Overflow clones to do it. And there are a bunch of clones: Array Shift is built in Drupal, OSQA is a Django app, and Shapado looks pretty full-featured. There are probably more.

It'll be interesting to watch the continued evolution of the Stack Overflow ecosystem. I'm not sure any community will get the critical mass that the original Stack Overflow did, but it's worth trying a few ways to make it happen.

Converting Blogger to Wordpress

Saturday 24 April 2010

Until last weekend, Susan's blog had been done with Blogger. We made use of the FTP feature to push all the content to static HTML files on her server. But Blogger is discontinuing FTP support, so we had to do something.

I'm a huge believer in keeping old URLs working, so I didn't want to switch to a blogspot.com blog, or even move to blog.susansenator.com. Besides, Blogger had been seeming pretty creaky for a while, so I took the opportunity to try something better, namely Wordpress.

Creating the Wordpress blog was pretty simple. Our hosting provider offers one-click installation which worked great. Making a Wordpress theme can be a big undertaking, but not if you're just trying to mimic an existing simple blog layout. I downloaded a simple theme and started hacking away on it. The Wordpress docs are pretty good, definitely better than Blogger's, that's a recurring theme here.

Migrating all the content over was a bigger deal. Blogger offers a backup facility that gives you your entire blog as a giant XML file. Converting that to a Wordpress format was simple with blog converters. Included is blogger2wordpress, which turned my 16Mb Blogger XML file into a 12Mb Wordpress XML file.

Then Wordpress can import the XML file, but maximum size 2Mb, why? So I manually split the big XML file into 8 smaller XML files, which was tedious but not difficult. Importing each of them brought in all the old blog posts and comments. Nice. (For some reason, embedded YouTube videos are now just a URL in text, not sure why. If I had noticed that earlier I may have been able to do something about it.)

Now we have a Wordpress blog that works just like the Blogger blog did, except that everything has a different permalink than it did before. The first step to fix that is to change the permalink style Wordpress uses. It defaults to something horrendous like:

http://susansenator.com/blog/?p=123

Select "Month and name" under Permalink settings in the Wordpress installation. This makes Wordpress use nice URLs like:

http://susansenator.com/blog/2010/04/here-be-dragons/

Changing this setting will either add or require you to add a chunk of mod_rewrite rules to your Apache .htaccess file:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /blog/
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /blog/index.php [L]
</IfModule>

But lots of other things are subtly different. Archive pages are named differently, Blogger had an index.html page for the blog, and so on. I manually added these rewrites to fix these issues:

# Blogger slugs have .html, wordpress does not.
RewriteRule ^blog/([0-9]{4})/([0-9]{2})/(.*)\.html?$ /blog/$1/$2/$3/ [R=301,L]

# Blogger archives are different.
RewriteRule ^blog/([0-9]{4})_([0-9]{2})_01_archive\.html /blog/$1/$2/ [R=301,L]

# Blogger feeds are now found at the wordpress feed
RewriteRule ^blog/atom.xml /blog/feed/ [R=301,L]
RewriteRule ^blog/rss.xml /blog/feed/ [R=301,L]

# Blogger had the old-style index.html.
RewriteRule ^blog/index.html /blog/ [R=301,L]

The thorniest problem, though, is that Blogger and Wordpress don't agree on how to turn a post title into a slug. Both lowercase the text and change spaces to dashes, but Wordpress includes every word, while Blogger leaves out "a" and "the", and maybe others.

The simplest way to solve the differing slug problem was to examine the wordpress.xml file. It had the title of the posts, and the Blogger slug, in the form of the post's permalink. I could determine which posts would have a new slug under Wordpress, and create a redirect for them.

A quick Python program did the work:

from lxml import etree
import re, sys

def items(f):
    doc = etree.parse(open(f))    
    items = doc.xpath('.//item')
    for item in items:
        title = item.xpath('title/text()')
        link = item.xpath('link/text()')
        if title and link:
            yield (title[0], link[0])

# Regexes for turning a title into a Wordpress slug
slugify = [
    # Drop everything but nice word characters
    (r"[^-a-z0-9 ]", ""),
    # All spaces become dashes
    (r" ", "-"),
    # Multiple dashes become one
    (r"-+", "-"),
    ]

def do_file(f):
    for title, link in items(f):
        if "susansenator.com" not in link:
            continue
        slug = link.split('/')[-1].split('.')[0]
        wpslug = title.lower()
        for pat, rep in slugify:
            wpslug = re.sub(pat, rep, wpslug)
        if wpslug != slug:
            old_path = link.replace("http://susansenator.com/", "")
            new_path = old_path.rsplit('/', 1)[0] + "/" + wpslug
            
            print "RewriteRule ^%s /%s [R=301,L]" % (
                old_path.replace(".", r"\."),
                new_path
            )
        
do_file(sys.argv[1])

This just looks at every post, extracts the Blogger slug from the post's link, and computes the Wordpress slug. Where the two slugs differ, a rewrite rule is written. On Susan's blog, this produced 446 rewrite rules, which went into .htaccess:

### These are posts that slugify differently under blogger and wordpress, to keep old permalinks working:
RewriteRule ^blog/2010/04/cheerful-feelings-upon-awakening-in\.html /blog/2010/04/cheerful-feelings-upon-awakening-in-the-country [R=301,L]
RewriteRule ^blog/2010/03/here-is-my-passover-album-on-facebook-i\.html /blog/2010/03/passover-pics [R=301,L]
RewriteRule ^blog/2010/03/reality-of-autism-rifts-and-what-obama\.html /blog/2010/03/the-reality-of-the-autism-rifts-and-what-obama-should-do [R=301,L]
# ... 440 skipped ...
RewriteRule ^blog/2005/10/autism-and-school-board\.html /blog/2005/10/autism-and-the-school-board [R=301,L]
RewriteRule ^blog/2005/10/speed-of-dark\.html /blog/2005/10/the-speed-of-dark [R=301,L]
RewriteRule ^blog/2005/10/adolescence-without-roadmap\.html /blog/2005/10/adolescence-without-a-roadmap [R=301,L]

With the new super-sized .htaccess in place, the new blog is ready to go. All existing links work well, and no one misses a beat.

Organic metaclasses

Sunday 18 April 2010

The way I learn things, I can read about something a number of times, and intellectually understand it, but it won't really sink in until I have a real reason to try it out myself. Toy examples don't do it for me, I have to have an actual problem in hand before the solution becomes part of my repertoire. Recently I finally had a use for metaclasses.

I wanted to create an in-memory list of items that I could reference by key. It was a micro-database of languages:

class Language(object):

    # The class attribute of all languages, mapped by id.    
    _db = {}
    
    def __init__(self, **kwargs):
        for k, v in kwargs.iteritems():
            setattr(self, k, v)
        self._db[self.id] = self
        
    @classmethod
    def get(cls, key):
        return cls._db.get(key)

Language(
    id = 'en',
    name = _('English'),
    native = u'English',
    )

Language(
    id = 'fr',
    name = _('French'),
    native = u'Fran\u00E7ais',
    )

Language(
    id = 'nl',
    name = _('Dutch'),
    native = u'Nederlands',
    )

# Some time later:
lang = Language.get(langcode)
lang.native # blah blah

This worked well, it gave me a simple schema-less set of constant items that I could look up by id. And the class attribute _db is used implicitly in the constructor, so I get a clean declarative syntax for building my list of languages.

But then I wanted another another set, for countries, so I made a MiniDbItem class to derive both Language and Country from:

class MiniDbItem(object):
    def __init__(self, **kwargs):
        for k, v in kwargs.iteritems():
            setattr(self, k, v)
        self._db[self.id] = self
        
    @classmethod
    def get(cls, key):
        return cls._db.get(key)

class Language(MiniDbItem):
    _db = {}

Language(id='en', ...)
Lanugage(id='fr', ...)

class Country(MiniDbItem):
    _db = {}
    
Country(id='US', ...)
Country(id='FR', ...)

This works, but the unfortunate part is that each derived class has to define it's own _db class attribute to keep the Languages separate from the Countries. Each derived class is obligated to do that little bit of redundant work, or the MiniDbItem base class isn't used properly.

The way to avoid that is to use a metaclass. The metaclass provides an __init__ method. In a class, __init__ is called when new class instances are created, but in a metaclass, __init__ is called when new classes are created.

class MetaMiniDbItem(type):
    """ A metaclass to give every class derived from MiniDbItem
        a _db attribute.
    """
    def __init__(cls, name, bases, dict):
        super(MetaMiniDbItem, cls).__init__(name, bases, dict)
        # Each class has its own _db, a dict of its items
        cls._db = {}

class MiniDbItem(object):
    
    __metaclass__ = MetaMiniDbItem

    def __init__(self, **kwargs):
        for k, v in kwargs.iteritems():
            setattr(self, k, v)
        self._db[self.id] = self
        
    @classmethod
    def get(cls, key):
        return cls._db.get(key)

class Language(MiniDbItem): pass

Language(id='en', ...)
Lanugage(id='fr', ...)

class Country(MiniDbItem): pass
    
Country(id='US', ...)
Country(id='FR', ...)

Now MetaMiniDbItem.__init__ is invoked twice: once when class Language is defined, and again when class Country is defined. The class being constructed is passed in as the cls parameter. We use super to invoke the regular class creation machinery, then simply set the _db attribute on the class like we want.

Of course, metaclasses can be used to do many more things than simply setting a class attribute, but this example was the first time in my work that metaclasses seemed like a natural solution to a problem rather than an advanced-magic Stupid Python Trick.

Web development peeve

Wednesday 14 April 2010

OK, in the scheme of things, this is really minor, but it irks me. Wouldn't it have been great if the query component of a URL started with an ampersand instead of a question mark?

How many times do we have to write something like this:

// Add foo onto the URL params
params += params ? "&" : "?";
params += "foo=1723";

If the query component started with ampersand, we could just tack on "&foo=1723" and be done with it. From a whole-URL view, there's some sense to separating the query and the path with a distinct character like question mark, but it's not like it would have been unparseable to say the query component starts with the first ampersand.

Next time we'll get it right... :)

And while we're on the subject, why has the Python library got the tools to deal with URLs as structured data spread across three different modules? Turns out it doesn't take much to pull them all together into a Url class that can help with URL construction and parsing tasks:

import cgi, urllib, urlparse

class Url(object):
    """A structured URL.
    
    Create from a string or Django request, then read or write the components
    through attributes `scheme`, `netloc`, `path`, `params`, `query`, and
    `fragment`.
    
    The query is more usefully available as the dictionary `args`.
    
    """
    def __init__(self, url):
        """Construct from a string or Django request."""
        if hasattr(url, 'get_full_path'):
            url = url.get_full_path()
        
        self.scheme, self.netloc, self.path, self.params, \
            self.query, self.fragment = urlparse.urlparse(url)
        self.args = dict(cgi.parse_qsl(self.query))

    def __str__(self):
        """Turn back into a URL."""
        self.query = urllib.urlencode(self.args)
        return urlparse.urlunparse((
            self.scheme, self.netloc, self.path, self.params,
            self.query, self.fragment
            ))

Now I can do stuff like:

# Redirect to one of our canonical hosts, with an extra arg.
url = Url(request)
url.netloc = THE_SECURE_HOST if request.is_secure() else THE_HOST
url.args['from'] = request.get_host()
return http.HttpResponseRedirect(str(url))

This takes care of all the Url syntax logic for me, so I don't have to think about question marks and ampersands ever again.

An Apache break in

Tuesday 13 April 2010

Apache.org had an incident last week which started as a cross-site scripting attack and ended with the attackers gaining root access to their servers. The full story is worth a read because it's instructional to see how the mistakes compound and the attackers used each new foothold to gain access to another deeper level in the system. It reads like a laundry list of simple security mistakes, but strung together in a real world scenario that resulted in a serious breach of security.

And it ends with a great honest example of the open source philosophy:

We hope our disclosure has been as open as possible and true to the ASF spirit. Hopefully others can learn from our mistakes.

« | » Main « | »