Reputation: 17829
so I have multiple lists of various lengths. I am trying to search a corpus for strings containing any combination of items from these lists using regular expressions. I have no clue how to do this and have been searching for a while to discover a solution. If somebody could enlighten me on a technique, or point me to some documentation, that will be an acceptable answer as well.
l1 = ["two","three",...,"twenty"]
l2 = ["cats","dogs","fish",...]
my_corpus = """
I have three cats and two dogs, but I have no fish.
I don't care too much about being correct so matching
the phrase dogs four is acceptable as well.
"""
Ideally I would like to match "three cats"
,"two dogs"
, and "dogs four"
here.
It is important to note that the number of actual lists varies and that order really doesn't matter. But there are certainly too many possible permutations to hardcode this. If I were to even write a script that would create the regular expression with a string of OR
, that string would end up being way too large (probably over 1 million characters).
r'abc'
I would like to be able to turn that into r'a(?:b|solution)c' where a,b, and c are already completed regex.
Let me know if there are any more questions.
Upvotes: 2
Views: 931
Reputation: 82899
You can create two big disjunctions from the lists, joining them with |
, and use those in another disjunction in the regular expression.
p = r"({0}) ({1})|({1}) ({0})".format("|".join(l1), "|".join(l2))
print repr(p)
for m in re.finditer(p, my_corpus):
print m.span(), repr(m.group())
Output (first the final regex, then the matches)
'(two|three|four) (cats|dogs|fish)|(cats|dogs|fish) (two|three|four)'
(12, 22) 'three cats'
(27, 35) 'two dogs'
(131, 140) 'dogs four'
I did some performance measurements (using just your lists and text, so this may not be representative for longer lists and texts), and it seems like the regex-approach is still about 8 times faster (~10µs) than the other solution matching all words and comparing them to the sets (~85µs), according to IPython's %timeit
.
Upvotes: 0
Reputation: 8147
does it have to be regex?
why not just iterate over words and check if any N
adjacent ones are from the lists?
import re
l1 = ["two","three","four","twenty"]
l2 = ["cats","dogs","fish"]
my_corpus = """
I have three cats and two dogs, but I have no fish.
I don't care too much about being correct so matching
the phrase dogs four is acceptable as well.
"""
non_alpha = re.compile('[^a-zA-Z]')
words = [non_alpha.sub('',w) for w in my_corpus.split()]
sets = map(set,[l1,l2])
N = len(sets)
for i in range(len(words)-N):
ws = words[i:i+N]
# each word is in at least one set
all_words = all(any(w in s for s in sets) for w in ws)
# each set has at least one word
all_sets = all(any(w in s for w in ws) for s in sets)
# conditions broken into two for readability, but maybe should be used
# inline for short circuiting, only important if number of sets is large
if all_words and all_sets:
print ' '.join(ws)
output:
three cats
two dogs
dogs four
also, as commented by @tobias_k, using sets instead of lists will speed things up
Upvotes: 3