Reputation: 1267
There's text file(about 300M) and need to count the top N frequency words. The first step is to read it from disk, now I simply use open.read().lower()
(case insensitive) is there a more effecient way to handle the IO part? Test machine has 8 cores 4G memory and Linux system, python version is 2.6.
Upvotes: 1
Views: 127
Reputation: 1121296
Yes, process the file line by line in an iterator:
with open(filename) as inputfile:
for line in inputfile:
line = line.lower()
This uses a buffer for read performance but does not put as much pressure on your memory, avoiding having to swap.
Next, use collections.Counter()
to do the frequency counting for you. It'll handle the counting and selecting the top N words for you, in the most efficient manner available in pure Python code.
A naive way to get words would be to split the lines on whitespace; combining that with a generator expression could give you all the word counts in one line of code:
from collections import Counter
with open(filename) as inputfile:
counts = Counter(word for line in inputfile for word in line.lower().split())
for word, frequency in counts.most_common(N):
print '{<40} {}'.format(word, frequency)
The Counter
class was added in Python 2.7; for 2.6 you can use this backport.
Upvotes: 4