|Ned Batchelder : Blog | Code | Text | Site|
Removing overlapping regex matches
» Home : Blog : September 2012
In a Stack Overflow question a few months ago, a petitioner wanted to remove all the matches of a number of regexes. The complication was that the regexes could overlap.
Simply using re.sub() on each pattern in turn wouldn't work, because the overlapping matches wouldn't be fully matched once the first patterns were removed from the string.
The solution is to match the regexes, and note the locations of the matches, and then in a second pass, delete all those parts of the string. Here's an updated version of my answer:
def remove_regexes(text, patterns):
There are a few rarely-used features of Python at work here. First, I use a bytearray, which is kind of like a mutable string. Like strings, it is a sequence of bytes. Unlike strings, you can change the bytes in place. This is handy for us to mark which portions of the string are being removed.
I initialize the bytearray to have the same contents as the text string, then for each pattern, I find all the matches for the pattern, and remove them from the bytearray by replacing the matched bytes with a zero bytes.
The re.finditer method gives us an iterator over all the matches, and produces a match object for each one. Match objects are usually just tested and then examined for the matched string, but they have other methods on them too. Here I use m.span(), which returns a two-tuple containing the starting and ending indexes of the match, suitable for use as a slice. I unpack them into start and end, and then use those indexes to write zero bytes into my bytearray using slice assignment.
Because I match against the original unchanged string, the overlapping regexes are not a problem. When all of the patterns have been matched, what's left in my bytearray are zero bytes where the patterns matched, and real byte values where they didn't. A list comprehension joins all the good bytes back together, and produces a string.
Nothing earth-shattering here, just a nice showcase of some little-used Python features.
tagged: python» 7 reactions