Reputation: 2504
I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:
content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
for j in range(i+1,len(L)):
wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
if (phrasecoccur>0):
f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")
Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?
Upvotes: 3
Views: 4268
Reputation: 104682
It should be possible to assemble your 40000 phrases into a big regular expression pattern, and use that to match against your document. It might not be as fast as something more job-specific, but it does work. Here's how I'd do it:
import re
class Matcher(object):
def __init__(self, phrases):
phrase_pattern = "|".join("(?:{})".format(phrase) for phrase in phrases)
gap_pattern = r"\W+(?:\w+\W+){0,4}?"
full_pattern = "({0}){1}({0})".format(phrase_pattern, gap_pattern)
self.regex = re.compile(full_pattern)
def match(self, doc):
return self.regex.findall(doc) # or use finditer to generate match objs
Here's how you can use it:
>>> L = ["brown fox", "lazy dog"]
>>> matcher = Matcher(L)
>>> doc = "The quick brown fox jumps over the lazy dog."
>>> matcher.match(doc)
[('brown fox', 'lazy dog')]
This solution does have a few limitations. One is that it won't detect overlapping pairs of phrases. So in the example, if you added the phrase "jumps over"
to the phrase list, you would still only get one matched pair, ("brown fox", "jumps over")
. It would miss both ("brown fox", "lazy dog")
and ("jumps over", "lazy dog")
, since they include some of the same words.
Upvotes: 1
Reputation: 133975
You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:
"brown fox", 3
"lazy dog", 8
etc.
You can either capture all of the output and post-process it, or you can process the matches as they're found.
It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.
I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.
Upvotes: 3
Reputation: 7807
Expanding on Joel's answer, your iterator could be something like this:
def doc_iter(doc):
words=doc[0:4]
yield words
for i in range(3,len(doc)):
words=words[1:]
words.append(doc[i])
yield words
put your phrases in a dict and use the iterator over the doc, checking the phrases at each iteration. This should give you performance between O(n) and O(n*log(n)).
Upvotes: 0