Human sorting

Tuesday 11 December 2007

Kate Rhodes wrote a post about simplistic vs. useful sorting: Alphabetical != ASCIIbetical (cute name). In it, she points to Dave Koelle’s Alphanum algorithm, which says to split the string to be sorted into numeric and non-numeric chunks, then sort so that the numeric chunks are treated as numbers. This makes “z2” sort after “z100”, for example.

I looked at the code Dave provided (in Java, C++, or Perl), and all of it is much longer than I expected: the C++ is 40 or so lines. A comment in there says it’s easier in a pattern language like Perl, but the Perl is still 20 iterative lines.

In Python:

import re

def tryint(s):
    try:
        return int(s)
    except ValueError:
        return s
    
def alphanum_key(s):
    """ Turn a string into a list of string and number chunks.
        "z23a" -> ["z", 23, "a"]
    """
    return [ tryint(c) for c in re.split('([0-9]+)', s) ]

def sort_nicely(l):
    """ Sort the given list in the way that humans expect.
    """
    l.sort(key=alphanum_key)

A few helpful Python features make this more compact: re.split provides just the function we want for chunking the string, sort takes a key function for computing sort keys from data, the key to sort on can itself be a list, and comparing two lists compares lexicographically among the elements of the list.

Each of these features in and of itself may have only occasional use, but here they conspire to help me write code nearly as expressive as the English description in my first paragraph. And there are enough of those features that they often help make expressive code like that.

Comments

[gravatar]

This is actually why I like Python so much (well, at least one of the reasons :).

In Python I can express my self better, faster, and without all the bureaucracy involved with other languages. That also makes for faster coding.

[gravatar]

Hmm, no time now, but I'd like to try an itertools.groupby() solution to remove the need for the regexp...

- Paddy.

[gravatar]

Had some time. I get the following snippet:


>>> from itertools import groupby
>>> s = 'zx1a23456qwert'
>>> [int("".join(g)) if k else "".join(g) for k,g in groupby(s, lambda x: x in '0123456789')]
['zx', 1, 'a', 23456, 'qwert']
>>>


- Paddy.

[gravatar]

Beautiful! But I'll suggest one minor point ...

Better not to use an exception, since neither of the two branches of tryint are really exceptions, and exception handling is slow.

How about something simpleminded like

if s[0] in ['0','1','2','3','4','5','6','7','8','9']:
return int(s)
else:
return s

or

if ord(s[0]) > 48 and ord(s[0]) < 58:
....

[gravatar]

I agree with Tom Passin on exceptions, but there's an easier & better way:

def tryint(s):
... if s.isdigit():
....... return int(s)
... else:
....... return s

[gravatar]

With Python 2.5+, you don't even need a tryint function. Replace alphanum_key with:

def alphanum_key(s):
... return [int(c) if c.isdigit() else c for c in re.split('([0-9]+)', s)]

[gravatar]
Neil McCallum 4:38 PM on 11 Dec 2007

Very nice.
How about
return [tryfloat(c) for c in re.split('(-?[0-9\.]+)',s)]

[gravatar]

Minor rewrite for python2.5 one can turn it into a single simple function like so:

-----

import re

def sort_nicely( l ):
""" Sort the given list in the way that humans expect.
"""
convert = lambda text: int(text) if text.isdigit() else text
alphanum = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
l.sort( key=alphanum_key )

[gravatar]

Here are some benchmark results from the different python sorting functions.

[gravatar]

I am usually the first person to avoid regexp, Paddy3118, but in this case its a far cleaner solution.

[gravatar]

Hi Calvin,
I don't think the regexp is a cleaner solution, have a look at my longer version that I posted here: http://weblog.masukomi.org/2007/12/10/alphabetical-asciibetical/comments/1199#comment-1199
The guts of which is:

def splitondigits(string):
return [int("".join(chars)) if digits else "".join(chars)
for digits,chars in groupby(string, str.isdigit)]

asciisorted = sorted(unsorted[:])
alphanumsorted = sorted(unsorted[:], key=splitondigits)


I had to read-up on groupby , whereas probably like you, I know the regexp syntax, but this seems a good fit for the groupby function.

[gravatar]

I think many users would expect uppercase and lowercase words to sort together, so I would use .upper() when returning the text version.

Otherwise 'abc' will sort after 'DEF'

[gravatar]

Here's a version that doesn't try to convert the non-numerical chunks:

def alphanum_key(s):
____key = re.split(r"(\d+)", s)
____key[1::2] = map(int, key[1::2])
____return key

[gravatar]

Shouldn't you better use "sorted(l)" instead of "l.sort()"? Normally I wouldn't want a function to have side-effects, unless necessary. "sorted" will also work on any iterable, whereas the "sort" method is only available for lists.

def sort_nicely(l):
____""" Sort the given list in the way that humans expect."""
____return sorted(l, key=alphanum_key)

[gravatar]
Maxwell Terry 12:42 PM on 16 Dec 2007

Just for the hell of it, here's the tersest rendition I could come up with (pulling source from Petter Otten, Christoper Arndt, and Dave St.Germain's post):

def nsort(l): return sorted(l, key=lambda s: [map(int, e[1::2]) for e in re.split(r'(\d+)', s)])

[gravatar]

Maxwell, your version does not work. The e[1::2] is done on the string, not on the list.

Ned, the python version does have another advantage over the (original) Java version: It works. The Java version throws NumberFormatExceptions if the numbers get to big (e.g. x999999999 > x8888888888). I have sent a changed Java version that uses the fact that bigger numbers are usually longer (taking care of trailing zeroes) to Dave Koelle.

[gravatar]

By the way, a one-line version:
def nsort(l) return sorted(l, key=lambda a:zip(re.split("(\\d+)", a)[0::2], map(int, re.split("(\\d+)", a)[1::2])))

[gravatar]

Thanks for this website. I can't believe there is not more info on this on the web - or in the standard python docs. Great one liner. Great Python. Thanks again, this saved me some trouble.

[gravatar]

Actually, I found it helpful to make one change to the one-liner of Andre. Adding .lower() makes it behave more like windows explorer, sorting Tom with tom. Otherwise it sorts all uppercase files before all lowercase files, as in Nancy, Ned, abigail.

def nsort(l) return sorted(l, key=lambda a.lower()):zip(re.split("(\\d+)", a)[0::2], map(int, re.split("(\\d+)", a)[1::2])))

[gravatar]

To really have it be a "human" sort, it'd be nice to correctly skip over leading "A" and "The". "The Bourne Supremacy" comes before "Caddyshack".

[gravatar]

just want everyone to know, in ruby this is all much prettier, and you get to make lambda functions without writing "lambda", which is always groovy. We're coming for you, dirty python snake language.

[gravatar]

Hi Sikanrong,
You've posted 13 months after the original article, making grandiose statements, all without code. I hope your not on the Ruby development team ;-)

- Paddy.

[gravatar]

Has anyone used the code in the main post with a data set that I could see and analyze so I implement it correctly? I'm a noob and am trying to apply a natural/human sort to a set of items in a listbox that can be pretty random.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.