Reputation: 31
I have some pretty large text files (>2g) that I would like to process word by word. The files are space-delimited text files with no line breaks (all words are in a single line). I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file.
This is my code right now:
with open('big_file_of_words', 'r') as in_file:
with open('output_file', 'w') as out_file:
words = in_file.read().split(' ')
for word in word:
if d.check(word) == True:
out_file.write("%s " % word)
I looked at lazy method for reading big file in python, which suggests using yield
to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. Basically, I want chunks to be as close to the specified size while splitting only on spaces. Any suggestions?
Upvotes: 3
Views: 2999
Reputation: 42748
Combine the last word of one chunk with the first of the next:
def read_words(filename):
last = ""
with open(filename) as inp:
while True:
buf = inp.read(10240)
if not buf:
break
words = (last+buf).split()
last = words.pop()
for word in words:
yield word
yield last
with open('output.txt') as output:
for word in read_words('input.txt'):
if check(word):
output.write("%s " % word)
Upvotes: 6
Reputation: 15160
fortunately Petr Viktorin has already written code for us. The following code reads a chunk from a file, then does a yield
for each contained word. If a word spans chunks, that's handled also.
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
https://stackoverflow.com/a/7745406/143880
Upvotes: 0
Reputation: 142106
You might be able to get away with something similar to an answer on the question you've linked to, but combining re
and mmap
, eg:
import mmap
import re
with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
for word in re.finditer('\w+', mf):
# do something
Upvotes: 1