Web development peeve

Wednesday 14 April 2010

OK, in the scheme of things, this is really minor, but it irks me. Wouldn't it have been great if the query component of a URL started with an ampersand instead of a question mark?

How many times do we have to write something like this:

// Add foo onto the URL params
params += params ? "&" : "?";
params += "foo=1723";

If the query component started with ampersand, we could just tack on "&foo=1723" and be done with it. From a whole-URL view, there's some sense to separating the query and the path with a distinct character like question mark, but it's not like it would have been unparseable to say the query component starts with the first ampersand.

Next time we'll get it right... :)

And while we're on the subject, why has the Python library got the tools to deal with URLs as structured data spread across three different modules? Turns out it doesn't take much to pull them all together into a Url class that can help with URL construction and parsing tasks:

import cgi, urllib, urlparse

class Url(object):
    """A structured URL.
    Create from a string or Django request, then read or write the components
    through attributes `scheme`, `netloc`, `path`, `params`, `query`, and
    The query is more usefully available as the dictionary `args`.
    def __init__(self, url):
        """Construct from a string or Django request."""
        if hasattr(url, 'get_full_path'):
            url = url.get_full_path()
        self.scheme, self.netloc, self.path, self.params, \
            self.query, self.fragment = urlparse.urlparse(url)
        self.args = dict(cgi.parse_qsl(self.query))

    def __str__(self):
        """Turn back into a URL."""
        self.query = urllib.urlencode(self.args)
        return urlparse.urlunparse((
            self.scheme, self.netloc, self.path, self.params,
            self.query, self.fragment

Now I can do stuff like:

# Redirect to one of our canonical hosts, with an extra arg.
url = Url(request)
url.netloc = THE_SECURE_HOST if request.is_secure() else THE_HOST
url.args['from'] = request.get_host()
return http.HttpResponseRedirect(str(url))

This takes care of all the Url syntax logic for me, so I don't have to think about question marks and ampersands ever again.


Andrew Dalke 8:17 AM on 14 Apr 2010

I recently found out that the scheme/netloc/... breakdown isn't quite right. I wanted to decide an IRI, and support http://user@xn--bcher-kva.ch/ where the netloc includes the IDNA-encoded hostname. I think the user is supposed to be utf-8 encoded. I ended up writing my own parser, but honestly I don't know enough to know if what I did fits into the larger scheme of how to work with IRIs.

Robert Brewer 10:42 AM on 14 Apr 2010

Well, first, the definition of the question mark is part of the URI spec, but the use of ampersands is part of the HTML spec. There are plenty of syntaxes for the query component of a URI that don't use ampersands at all; for example, the "server-side image map" style which provides a comma-separated pair of integer coordinates: "http://foo/bar?100,385". Heck, you could pass percent-encoded JSON in the query string if you wanted. You are certainly free to use URL's with a scheme like "http://foo/bar?a=3?b=4?c=5"; only HTML forms are probably going to hold you back.

So you should really rename your Url class to something like DjangoHTMLFormsUrl, since it only handles that one query syntax. Unless you want to hide that information because you want to be stuck with HTML only and forever ;) There are other media types in town; I'll introduce you if you buy the drinks.

Noah Kantrowitz 11:45 AM on 14 Apr 2010

Isn't the new recommendation to use ; instead of & anyway? The choice of & as a separator was pretty silly given that it has to be escaped in HTML.

Andrew Dalke 11:45 AM on 14 Apr 2010

For that matter, quoting from http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.2.2 :

We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.

Jason Chu 3:32 PM on 14 Apr 2010

And I thought I was just being silly by calling a method that collects query parameters in a list or dict and outputs a query string: url = base_url + '?' + querystring; If no query string, no ?. All major javascript libraries have them. Python has one too in urllib.urlencode.

Calvin Spealman 10:10 AM on 17 Apr 2010

The return of urlparse() is already a structured object, but I wish it was mutable. That you have to wrap it just to provide mutability is very silly.

Aron Griffis 9:53 PM on 22 Feb 2011

Hi Ned, I recently found a small problem in this class. I think you want to pass keep_blank_values=True to cgi.parse_qsl (or urlparse.parse_qsl as we're using now). Otherwise http://foo.com/?bar= becomes http://foo.com/ when reassembled. It's also possible that you want parse_qs rather than parse_qsl but I haven't dug into that to know for sure.

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.