Human sorting

Tuesday 11 December 2007

Kate Rhodes wrote a post about simplistic vs. useful sorting: Alphabetical != ASCIIbetical (cute name). In it, she points to Dave Koelle’s Alphanum algorithm, which says to split the string to be sorted into numeric and non-numeric chunks, then sort so that the numeric chunks are treated as numbers. This makes “z2” sort after “z100”, for example.

I looked at the code Dave provided (in Java, C++, or Perl), and all of it is much longer than I expected: the C++ is 40 or so lines. A comment in there says it’s easier in a pattern language like Perl, but the Perl is still 20 iterative lines.

In Python:

import re

def tryint(s):
        return int(s)
    except ValueError:
        return s
def alphanum_key(s):
    """ Turn a string into a list of string and number chunks.
        "z23a" -> ["z", 23, "a"]
    return [ tryint(c) for c in re.split('([0-9]+)', s) ]

def sort_nicely(l):
    """ Sort the given list in the way that humans expect.

A few helpful Python features make this more compact: re.split provides just the function we want for chunking the string, sort takes a key function for computing sort keys from data, the key to sort on can itself be a list, and comparing two lists compares lexicographically among the elements of the list.

Each of these features in and of itself may have only occasional use, but here they conspire to help me write code nearly as expressive as the English description in my first paragraph. And there are enough of those features that they often help make expressive code like that.


lorg 6:21 AM on 11 Dec 2007

This is actually why I like Python so much (well, at least one of the reasons :).

In Python I can express my self better, faster, and without all the bureaucracy involved with other languages. That also makes for faster coding.

Paddy3118 10:31 AM on 11 Dec 2007

Hmm, no time now, but I'd like to try an itertools.groupby() solution to remove the need for the regexp...

- Paddy.

Paddy3118 1:42 PM on 11 Dec 2007

Had some time. I get the following snippet:

>>> from itertools import groupby
>>> s = 'zx1a23456qwert'
>>> [int("".join(g)) if k else "".join(g) for k,g in groupby(s, lambda x: x in '0123456789')]
['zx', 1, 'a', 23456, 'qwert']

- Paddy.

Tom Passin 3:07 PM on 11 Dec 2007

Beautiful! But I'll suggest one minor point ...

Better not to use an exception, since neither of the two branches of tryint are really exceptions, and exception handling is slow.

How about something simpleminded like

if s[0] in ['0','1','2','3','4','5','6','7','8','9']:
return int(s)
return s


if ord(s[0]) > 48 and ord(s[0]) < 58:

David Goodger 3:21 PM on 11 Dec 2007

I agree with Tom Passin on exceptions, but there's an easier & better way:

def tryint(s):
... if s.isdigit():
....... return int(s)
... else:
....... return s

David Goodger 3:30 PM on 11 Dec 2007

With Python 2.5+, you don't even need a tryint function. Replace alphanum_key with:

def alphanum_key(s):
... return [int(c) if c.isdigit() else c for c in re.split('([0-9]+)', s)]

Neil McCallum 4:38 PM on 11 Dec 2007

Very nice.
How about
return [tryfloat(c) for c in re.split('(-?[0-9\.]+)',s)]

Toothy 6:48 PM on 11 Dec 2007

Minor rewrite for python2.5 one can turn it into a single simple function like so:


import re

def sort_nicely( l ):
""" Sort the given list in the way that humans expect.
convert = lambda text: int(text) if text.isdigit() else text
alphanum = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
l.sort( key=alphanum_key )

Dave St.Germain 9:01 PM on 11 Dec 2007

Here are some benchmark results from the different python sorting functions.

Calvin Spealman 9:37 PM on 11 Dec 2007

I am usually the first person to avoid regexp, Paddy3118, but in this case its a far cleaner solution.

Paddy3118 2:26 AM on 12 Dec 2007

Hi Calvin,
I don't think the regexp is a cleaner solution, have a look at my longer version that I posted here:
The guts of which is:

def splitondigits(string):
return [int("".join(chars)) if digits else "".join(chars)
for digits,chars in groupby(string, str.isdigit)]

asciisorted = sorted(unsorted[:])
alphanumsorted = sorted(unsorted[:], key=splitondigits)

I had to read-up on groupby , whereas probably like you, I know the regexp syntax, but this seems a good fit for the groupby function.

Simon 7:19 AM on 13 Dec 2007

I think many users would expect uppercase and lowercase words to sort together, so I would use .upper() when returning the text version.

Otherwise 'abc' will sort after 'DEF'

Peter Otten 2:29 PM on 14 Dec 2007

Here's a version that doesn't try to convert the non-numerical chunks:

def alphanum_key(s):
____key = re.split(r"(\d+)", s)
____key[1::2] = map(int, key[1::2])
____return key

Christopher Arndt 10:41 AM on 15 Dec 2007

Shouldn't you better use "sorted(l)" instead of "l.sort()"? Normally I wouldn't want a function to have side-effects, unless necessary. "sorted" will also work on any iterable, whereas the "sort" method is only available for lists.

def sort_nicely(l):
____""" Sort the given list in the way that humans expect."""
____return sorted(l, key=alphanum_key)

Maxwell Terry 12:42 PM on 16 Dec 2007

Just for the hell of it, here's the tersest rendition I could come up with (pulling source from Petter Otten, Christoper Arndt, and Dave St.Germain's post):

def nsort(l): return sorted(l, key=lambda s: [map(int, e[1::2]) for e in re.split(r'(\d+)', s)])

Andre Bogus 6:26 AM on 21 Dec 2007

Maxwell, your version does not work. The e[1::2] is done on the string, not on the list.

Ned, the python version does have another advantage over the (original) Java version: It works. The Java version throws NumberFormatExceptions if the numbers get to big (e.g. x999999999 > x8888888888). I have sent a changed Java version that uses the fact that bigger numbers are usually longer (taking care of trailing zeroes) to Dave Koelle.

Andre Bogus 11:18 AM on 6 Jan 2008

By the way, a one-line version:
def nsort(l) return sorted(l, key=lambda a:zip(re.split("(\\d+)", a)[0::2], map(int, re.split("(\\d+)", a)[1::2])))

Py User 3:15 AM on 23 Mar 2008

Thanks for this website. I can't believe there is not more info on this on the web - or in the standard python docs. Great one liner. Great Python. Thanks again, this saved me some trouble.

Py User 3:38 AM on 23 Mar 2008

Actually, I found it helpful to make one change to the one-liner of Andre. Adding .lower() makes it behave more like windows explorer, sorting Tom with tom. Otherwise it sorts all uppercase files before all lowercase files, as in Nancy, Ned, abigail.

def nsort(l) return sorted(l, key=lambda a.lower()):zip(re.split("(\\d+)", a)[0::2], map(int, re.split("(\\d+)", a)[1::2])))

Sean 12:14 PM on 27 May 2008

To really have it be a "human" sort, it'd be nice to correctly skip over leading "A" and "The". "The Bourne Supremacy" comes before "Caddyshack".

sikanrong 11:16 AM on 12 Jan 2009

just want everyone to know, in ruby this is all much prettier, and you get to make lambda functions without writing "lambda", which is always groovy. We're coming for you, dirty python snake language.

Paddy3118 2:16 AM on 13 Jan 2009

Hi Sikanrong,
You've posted 13 months after the original article, making grandiose statements, all without code. I hope your not on the Ruby development team ;-)

- Paddy.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.