Returning unique matches using regex in python

Question

What's the most efficient (or most Pythonic) way to generate all unique matches of a regex? Right now I'm just applying set() after using findall, but I wasn't sure if there's a better way.

melwil · Accepted Answer

Other ways to find all matches

Using other ways to search in text with regex allows you to interact with the matches you obtain along the way, and allows you to compare that to your list, eliminating duplicates along the way. This method means that you will, in the worst case scenario, compare every match to all the others and never find a duplicate. There are ways you can speed this up of course, like putting the matches in the set while you find them, which due to the implementation of set will cause every lookup to run in O(1) notation. This is basically the best run time any operation can ever have, and it's true for any size of set.

The slow python and fast C

So if you did one match and added it to the set, each item would require 1 run time to append to the set, which totals to O(n) for n found items. What you have not included is the time it takes for the framework to manage the loop, positional arguments and so on. The re module in python is made in C, which is a lot faster at batch working. There are actually packages designed to take operations that require looping and increase their speed a few orders of magnitude by using C instead. An example of such is numpy. If you want to see just how big this difference is, watch this video from PyCon 2015

I'm pretty sure, though I didn't test it, that trying to match the speed at which findall extracts all matches with a regex would be impossible. Since it doesn't have python code bogging the speed of the process down, being made in C, it would doubtlessly be the fastest way of getting the results with regex.

Since you have no way to interact with the matches before findall returns the list, you are left with ways to eliminate duplicates in a list in python. This is explained well with examples in this post:

Removing duplicates in lists

The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.

The following example should cover whatever you are trying to do:

>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]

I've already gone over how good set is at looking up duplicates, and it will do that at the same speed you could put the items in manually using a loop. Which means that if you use these two ways of getting all the matches and eliminating duplicates, you are already exceeding what you can duplicate in python code.

Conclusion

Unless there is a combined way to do both these operations available in a compiled module, I doubt you could beat the speed at which findall and set is working already.

Returning unique matches using regex in python

Answers (1)

Other ways to find all matches

The slow python and fast C

Conclusion

Related Questions