Reputation: 17829

regex with multiple lists

so I have multiple lists of various lengths. I am trying to search a corpus for strings containing any combination of items from these lists using regular expressions. I have no clue how to do this and have been searching for a while to discover a solution. If somebody could enlighten me on a technique, or point me to some documentation, that will be an acceptable answer as well.

EXAMPLE

l1 = ["two","three",...,"twenty"]
l2 = ["cats","dogs","fish",...]
my_corpus = """
    I have three cats and two dogs, but I have no fish. 
    I don't care too much about being correct so matching
    the phrase dogs four is acceptable as well.
"""

Ideally I would like to match "three cats","two dogs", and "dogs four" here.

It is important to note that the number of actual lists varies and that order really doesn't matter. But there are certainly too many possible permutations to hardcode this. If I were to even write a script that would create the regular expression with a string of OR, that string would end up being way too large (probably over 1 million characters).

EDIT NOTES

This must be possible with arbitrary numbers of lists. As mentioned, hard coding is not possible here because of the size of these lists. In reality I am using 6 lists, of size 2,10,10,100,1000,10000 respectively.
This must be part of a regular expression. This is a subproblem that is part of a complicated parsing attempt that has already been completed in regex. Let the preexisting regex be r'abc' I would like to be able to turn that into r'a(?:b|solution)c' where a,b, and c are already completed regex.
- It must match any arbitrary number of these words in a row

Let me know if there are any more questions.

Upvotes: 2

Answers (2)

tobias_k

Reputation: 82899

You can create two big disjunctions from the lists, joining them with |, and use those in another disjunction in the regular expression.

p = r"({0}) ({1})|({1}) ({0})".format("|".join(l1), "|".join(l2))
print repr(p)
for m in re.finditer(p, my_corpus):
    print m.span(), repr(m.group())

Output (first the final regex, then the matches)

'(two|three|four) (cats|dogs|fish)|(cats|dogs|fish) (two|three|four)'
(12, 22) 'three cats'
(27, 35) 'two dogs'
(131, 140) 'dogs four'

I did some performance measurements (using just your lists and text, so this may not be representative for longer lists and texts), and it seems like the regex-approach is still about 8 times faster (~10µs) than the other solution matching all words and comparing them to the sets (~85µs), according to IPython's %timeit.

Upvotes: 0

yurib

Reputation: 8147

does it have to be regex?

why not just iterate over words and check if any N adjacent ones are from the lists?

import re

l1 = ["two","three","four","twenty"]
l2 = ["cats","dogs","fish"]

my_corpus = """
    I have three cats and two dogs, but I have no fish.
    I don't care too much about being correct so matching
    the phrase dogs four is acceptable as well.
"""
non_alpha = re.compile('[^a-zA-Z]')
words = [non_alpha.sub('',w) for w in my_corpus.split()]

sets = map(set,[l1,l2])
N = len(sets)
for i in range(len(words)-N):
    ws = words[i:i+N]

    # each word is in at least one set
    all_words = all(any(w in s for s in sets) for w in ws)

    # each set has at least one word
    all_sets = all(any(w in s for w in ws) for s in sets)

    # conditions broken into two for readability, but maybe should be used
    # inline for short circuiting, only important if number of sets is large
    if  all_words and all_sets:
        print ' '.join(ws)

output:

three cats
two dogs
dogs four

also, as commented by @tobias_k, using sets instead of lists will speed things up

Upvotes: 3

regex with multiple lists

EXAMPLE

EDIT NOTES

Answers (2)

Related Questions