Reputation:
I want to get each line from a text file in Python ( around 1 billion lines) and from each line I am taking some words and inserting in another file I have used
with open('') as f:
for line in f:
process_line(line)
This process is taking a lot of time, How can I process this to read all the contents in about 2 hours ?
Upvotes: 2
Views: 3632
Reputation: 741
read about generators in Python. Yours code should look like this:
def read_file(yours_file):
while True:
data = yours_file.readline()
if not data:
break
yield data
Upvotes: 1
Reputation: 107124
The bottleneck of the performance of your script likely comes from the fact that it is writing to 3 files at the same time, causing massive fragmentation between the files and hence lots of overhead.
So instead of writing to 3 files at the same time as you read through the lines, you can buffer up a million lines (which should take less than 1GB of memory), before you write the 3 million words to the output files one file at a time, so that it will produce much less file fragmentation:
def write_words(words, *files):
for i, file in enumerate(files):
for word in words:
file.write(word[i] + '\n')
words = []
with open('input.txt', 'r') as f, open('words1.txt', 'w') as out1, open('words2.txt', 'w') as out2, open('words3.txt', 'w') as out3:
for count, line in enumerate(f, 1):
words.append(line.rstrip().split(','))
if count % 1000000 == 0:
write_words(words, out1, out2, out3)
words = []
write_words(words, out1, out2, out3)
Upvotes: 3