Optimize a find and match code in Python

Question

I have a code which takes as input two files: (1) a dictionary/lexicon (2) a text file (one sentence per line)

The first part of my code reads the dictionary in tuples so outputs something like:

('mthy3lkw', 'weakBelief', 'U')

('mthy3lkm', 'firmBelief', 'B')

('mthy3lh', 'notBelief', 'A')

The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it's type.

So given the sentence mthy3lkw ana mesh 3arif , desired output is:

["mthy3lkw ana mesh 3arif", 'mthy3lkw', 'weakBelief', 'U'] given that the highlighted word is found in the dictionary.

The second part of my code - the matching part - is TOO slow. How do I make it faster?

Here is my code

findings = [] 
for sentence in data:  # I open the sentences file with .readlines()
    for word in tuples:  # similar to the ones mentioned above
        p1 = re.compile('\b%s\b'%word[0])  # get the first word in every tuple
        if p1.findall(sentence) and word[1] == "firmBelief":
            findings.append([sentence, word[0], "firmBelief"])

print findings

Ants Aasma · Accepted Answer

Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict:

# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)

findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
    for word in word_re.findall(sentence): # Check every word in the sentence
        if word in word_dictionary: # A match was found
            entry = word_dictionary[word]
            findings.append([sentence, word, entry[1], entry[2]])

Optimize a find and match code in Python

Answers (2)

Related Questions