Reputation:
What's the most efficient (or most Pythonic) way to generate all unique matches of a regex? Right now I'm just applying set()
after using findall
, but I wasn't sure if there's a better way.
Upvotes: 4
Views: 3398
Reputation: 2553
Using other ways to search in text with regex allows you to interact with the matches you obtain along the way, and allows you to compare that to your list, eliminating duplicates along the way. This method means that you will, in the worst case scenario, compare every match to all the others and never find a duplicate. There are ways you can speed this up of course, like putting the matches in the set
while you find them, which due to the implementation of set
will cause every lookup to run in O(1) notation. This is basically the best run time any operation can ever have, and it's true for any size of set
.
So if you did one match and added it to the set
, each item would require 1 run time to append to the set
, which totals to O(n) for n
found items. What you have not included is the time it takes for the framework to manage the loop, positional arguments and so on. The re
module in python is made in C, which is a lot faster at batch working. There are actually packages designed to take operations that require looping and increase their speed a few orders of magnitude by using C instead. An example of such is numpy
. If you want to see just how big this difference is, watch this video from PyCon 2015
I'm pretty sure, though I didn't test it, that trying to match the speed at which findall
extracts all matches with a regex would be impossible. Since it doesn't have python code bogging the speed of the process down, being made in C, it would doubtlessly be the fastest way of getting the results with regex.
Since you have no way to interact with the matches before findall
returns the list, you are left with ways to eliminate duplicates in a list in python. This is explained well with examples in this post:
The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.
The following example should cover whatever you are trying to do:
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]
I've already gone over how good set
is at looking up duplicates, and it will do that at the same speed you could put the items in manually using a loop. Which means that if you use these two ways of getting all the matches and eliminating duplicates, you are already exceeding what you can duplicate in python code.
Unless there is a combined way to do both these operations available in a compiled module, I doubt you could beat the speed at which findall
and set
is working already.
Upvotes: 3