Reputation: 561
I have a code which takes as input two files: (1) a dictionary/lexicon (2) a text file (one sentence per line)
The first part of my code reads the dictionary in tuples so outputs something like:
('mthy3lkw', 'weakBelief', 'U')
('mthy3lkm', 'firmBelief', 'B')
('mthy3lh', 'notBelief', 'A')
The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it's type.
So given the sentence mthy3lkw ana mesh 3arif , desired output is:
["mthy3lkw ana mesh 3arif", 'mthy3lkw', 'weakBelief', 'U'] given that the highlighted word is found in the dictionary.
The second part of my code - the matching part - is TOO slow. How do I make it faster?
Here is my code
findings = []
for sentence in data: # I open the sentences file with .readlines()
for word in tuples: # similar to the ones mentioned above
p1 = re.compile('\\b%s\\b'%word[0]) # get the first word in every tuple
if p1.findall(sentence) and word[1] == "firmBelief":
findings.append([sentence, word[0], "firmBelief"])
print findings
Upvotes: 1
Views: 288
Reputation: 54882
Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict:
# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)
findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
for word in word_re.findall(sentence): # Check every word in the sentence
if word in word_dictionary: # A match was found
entry = word_dictionary[word]
findings.append([sentence, word, entry[1], entry[2]])
Upvotes: 1
Reputation: 798746
Convert your list of tuples into a trie, and use that for searching.
Upvotes: 1