When sorting strings, you’d often like the order to make sense to a person. That means numbers need to be treated numerically even if they are in a larger string.
For example, sorting Python versions with the default sort() would give you:
Python 3.10
Python 3.11
Python 3.9
when you want it to be:
Python 3.9
Python 3.10
Python 3.11
I wrote about this long ago (Human sorting), but have continued to tweak the code and needed to add it to a project recently. Here’s the latest:
import re
def human_key(s: str) -> tuple[list[str | int], str]:
"""Turn a string into a sortable value that works how humans expect.
"z23A" -> (["z", 23, "a"], "z23A")
The original string is appended as a last value to ensure the
key is unique enough so that "x1y" and "x001y" can be distinguished.
"""
def try_int(s: str) -> str | int:
"""If `s` is a number, return an int, else `s` unchanged."""
try:
return int(s)
except ValueError:
return s
return ([try_int(c) for c in re.split(r"(\d+)", s.casefold())], s)
def human_sort(strings: list[str]) -> None:
"""Sort a list of strings how humans expect."""
strings.sort(key=human_key)
The central idea here is to turn a string like "Python 3.9"
into the
key ["Python ", 3, ".", 9]
so that numeric components will be sorted by
their numeric value. The re.split() function gives us interleaved words and
numbers, and try_int() turns the numbers into actual numbers, giving us sortable
key lists.
There are two improvements from the original:
- The sort is made case-insensitive by using casefold() to lower-case the string.
- The key returned is now a two-element tuple: the first element is the list
of intermixed strings and integers that gives us the ordering we want. The
second element is the original string unchanged to ensure that unique strings
will always result in distinct keys. Without it,
"x1y"
and"x001Y"
would both produce the same key. This solves a problem that actually happened when sorting the items of a dictionary.# Without the tuple: different strings, same key!!
human_key("x1y") -> ["x", 1, "y"]
human_key("x001Y") -> ["x", 1, "y"]
# With the tuple: different strings, different keys.
human_key("x1y") -> (["x", 1, "y"], "x1y")
human_key("x001Y") -> (["x", 1, "y"], "x001Y")
If you are interested, there are many different ways to split the string into the word/number mix. The comments on the old post have many alternatives, and there are certainly more.
This still makes some assumptions about what is wanted, and doesn’t cover all possible options (floats? negative/positive? full file paths?). For those, you probably want the full-featured natsort (natural sort) package.
Comments
Another lovely piece of Python, Ned, thank you! I’ve been using a variation of your original code for many years (and the excellent natsort too, on bigger projects). It’s a great showcase of Python’s strengths.
Using such human sorting is one of those things that doesn’t make a big splash, but really shows its quality over time, with less ‘unexpected’ behaviours and confused users.
Add a comment: