Saturday 8 September 2012 — This is 12 years old. Be careful.
In a Stack Overflow question a few months ago, a petitioner wanted to remove all the matches of a number of regexes. The complication was that the regexes could overlap.
Simply using re.sub() on each pattern in turn wouldn’t work, because the overlapping matches wouldn’t be fully matched once the first patterns were removed from the string.
The solution is to match the regexes, and note the locations of the matches, and then in a second pass, delete all those parts of the string. Here’s an updated version of my answer:
def remove_regexes(text, patterns):
"""Remove all the matches of any pattern in patterns from text."""
bytes = bytearray(text)
for pat in patterns:
for m in re.finditer(pat, text):
start, end = m.span()
bytes[start:end] = [0] * (end-start)
new_string = ''.join(chr(c) for c in bytes if c)
return new_string
There are a few rarely-used features of Python at work here. First, I use a bytearray, which is kind of like a mutable string. Like strings, it is a sequence of bytes. Unlike strings, you can change the bytes in place. This is handy for us to mark which portions of the string are being removed.
I initialize the bytearray to have the same contents as the text string, then for each pattern, I find all the matches for the pattern, and remove them from the bytearray by replacing the matched bytes with a zero bytes.
The re.finditer method gives us an iterator over all the matches, and produces a match object for each one. Match objects are usually just tested and then examined for the matched string, but they have other methods on them too. Here I use m.span(), which returns a two-tuple containing the starting and ending indexes of the match, suitable for use as a slice. I unpack them into start and end, and then use those indexes to write zero bytes into my bytearray using slice assignment.
Because I match against the original unchanged string, the overlapping regexes are not a problem. When all of the patterns have been matched, what’s left in my bytearray are zero bytes where the patterns matched, and real byte values where they didn’t. A list comprehension joins all the good bytes back together, and produces a string.
Nothing earth-shattering here, just a nice showcase of some little-used Python features.
Comments
This trick also needs a strong caveat that it only works for ASCII or latin-1 encoded text.
Along the lines of:
A slight change to your solution would be to use a set to store the index of every character matched instead of storing that data in a byte array. Being a set it automatically ignores duplicates added to the set. Getting on a pedantic soap box, this way does not use the content of the byte as an indicator and separates the data from the meta-data.
Whenever there is a gap between the maximum end of any previous match and the start of the next match, a chunk of clean text is yielded. If you're using Python 3, you could use 'yield from' instead of 'yield' to have it generate the clean text one character at a time.
This is the simplest solution I came up with. Finally found a use for itertools.compress.
Add a comment: