Web development peeve

Wednesday 14 April 2010This is 13 years old. Be careful.

OK, in the scheme of things, this is really minor, but it irks me. Wouldn’t it have been great if the query component of a URL started with an ampersand instead of a question mark?

How many times do we have to write something like this:

// Add foo onto the URL params
params += params ? "&" : "?";
params += "foo=1723";

If the query component started with ampersand, we could just tack on “&foo=1723” and be done with it. From a whole-URL view, there’s some sense to separating the query and the path with a distinct character like question mark, but it’s not like it would have been unparseable to say the query component starts with the first ampersand.

Next time we’ll get it right... :)

And while we’re on the subject, why has the Python library got the tools to deal with URLs as structured data spread across three different modules? Turns out it doesn’t take much to pull them all together into a Url class that can help with URL construction and parsing tasks:

import cgi, urllib, urlparse

class Url(object):
    """A structured URL.
    Create from a string or Django request, then read or write the components
    through attributes `scheme`, `netloc`, `path`, `params`, `query`, and
    The query is more usefully available as the dictionary `args`.
    def __init__(self, url):
        """Construct from a string or Django request."""
        if hasattr(url, 'get_full_path'):
            url = url.get_full_path()
        self.scheme, self.netloc, self.path, self.params, \
            self.query, self.fragment = urlparse.urlparse(url)
        self.args = dict(cgi.parse_qsl(self.query))

    def __str__(self):
        """Turn back into a URL."""
        self.query = urllib.urlencode(self.args)
        return urlparse.urlunparse((
            self.scheme, self.netloc, self.path, self.params,
            self.query, self.fragment

Now I can do stuff like:

# Redirect to one of our canonical hosts, with an extra arg.
url = Url(request)
url.netloc = THE_SECURE_HOST if request.is_secure() else THE_HOST
url.args['from'] = request.get_host()
return http.HttpResponseRedirect(str(url))

This takes care of all the Url syntax logic for me, so I don’t have to think about question marks and ampersands ever again.


I recently found out that the scheme/netloc/... breakdown isn't quite right. I wanted to decide an IRI, and support http://user@xn--bcher-kva.ch/ where the netloc includes the IDNA-encoded hostname. I think the user is supposed to be utf-8 encoded. I ended up writing my own parser, but honestly I don't know enough to know if what I did fits into the larger scheme of how to work with IRIs.
Well, first, the definition of the question mark is part of the URI spec, but the use of ampersands is part of the HTML spec. There are plenty of syntaxes for the query component of a URI that don't use ampersands at all; for example, the "server-side image map" style which provides a comma-separated pair of integer coordinates: "http://foo/bar?100,385". Heck, you could pass percent-encoded JSON in the query string if you wanted. You are certainly free to use URL's with a scheme like "http://foo/bar?a=3?b=4?c=5"; only HTML forms are probably going to hold you back.

So you should really rename your Url class to something like DjangoHTMLFormsUrl, since it only handles that one query syntax. Unless you want to hide that information because you want to be stuck with HTML only and forever ;) There are other media types in town; I'll introduce you if you buy the drinks.
Isn't the new recommendation to use ; instead of & anyway? The choice of & as a separator was pretty silly given that it has to be escaped in HTML.
For that matter, quoting from http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2 :

We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.

And I thought I was just being silly by calling a method that collects query parameters in a list or dict and outputs a query string: url = base_url + '?' + querystring; If no query string, no ?. All major javascript libraries have them. Python has one too in urllib.urlencode.
The return of urlparse() is already a structured object, but I wish it was mutable. That you have to wrap it just to provide mutability is very silly.
Hi Ned, I recently found a small problem in this class. I think you want to pass keep_blank_values=True to cgi.parse_qsl (or urlparse.parse_qsl as we're using now). Otherwise http://foo.com/?bar= becomes http://foo.com/ when reassembled. It's also possible that you want parse_qs rather than parse_qsl but I haven't dug into that to know for sure.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.