Reputation: 57
I would like to scan a 8GB textfile (it's a log file) to find specific words. These words are stored in a dataframe with over 3400 rows.
I've tried the solution below, which avoids having to load the entire document:
with open(filename) as f:
for line in f:
do_stuff(line)
However, this is taking a very long time to process. It takes over 2 minutes to scan the entire document for one word. Multiplying it with 3400 would take 113 hours to complete the script.
Is there anyway to improve this process?
Upvotes: 0
Views: 47
Reputation: 51998
Create a set of the words: words = set(column_of_words)
Then do something like:
with open(filename) as f:
for line in f:
words_in_line = set(line.split())
matches = words & words_in_line #the intersection
if len(matches) > 0:
#do something with the matches
Whatever you do -- don't scan the same file 3400 times. Find a way to scan it just once.
Upvotes: 2