Kenneth De Coster
Kenneth De Coster

Reputation: 57

Python - Best way to find data in a very large textfile (8GB)

I would like to scan a 8GB textfile (it's a log file) to find specific words. These words are stored in a dataframe with over 3400 rows.

I've tried the solution below, which avoids having to load the entire document:

with open(filename) as f:
for line in f:
    do_stuff(line)

However, this is taking a very long time to process. It takes over 2 minutes to scan the entire document for one word. Multiplying it with 3400 would take 113 hours to complete the script.

Is there anyway to improve this process?

Upvotes: 0

Views: 47

Answers (1)

John Coleman
John Coleman

Reputation: 51998

Create a set of the words: words = set(column_of_words)

Then do something like:

with open(filename) as f:
    for line in f:
        words_in_line = set(line.split())
        matches = words & words_in_line #the intersection
        if len(matches) > 0:
            #do something with the matches

Whatever you do -- don't scan the same file 3400 times. Find a way to scan it just once.

Upvotes: 2

Related Questions