Elissa
Elissa

Reputation: 31

Reading a very large file word by word in Python

I have some pretty large text files (>2g) that I would like to process word by word. The files are space-delimited text files with no line breaks (all words are in a single line). I want to take each word, test if it is a dictionary word (using enchant), and if so, write it to a new file.

This is my code right now:

with open('big_file_of_words', 'r') as in_file:
        with open('output_file', 'w') as out_file:
            words = in_file.read().split(' ')
            for word in word:
                if d.check(word) == True:
                    out_file.write("%s " % word)

I looked at lazy method for reading big file in python, which suggests using yield to read in chunks, but I am concerned that using chunks of predetermined size will split words in the middle. Basically, I want chunks to be as close to the specified size while splitting only on spaces. Any suggestions?

Upvotes: 3

Views: 2999

Answers (3)

Daniel
Daniel

Reputation: 42748

Combine the last word of one chunk with the first of the next:

def read_words(filename):
    last = ""
    with open(filename) as inp:
        while True:
            buf = inp.read(10240)
            if not buf:
                break
            words = (last+buf).split()
            last = words.pop()
            for word in words:
                yield word
        yield last

with open('output.txt') as output:
    for word in read_words('input.txt'):
        if check(word):
            output.write("%s " % word)

Upvotes: 6

johntellsall
johntellsall

Reputation: 15160

fortunately Petr Viktorin has already written code for us. The following code reads a chunk from a file, then does a yield for each contained word. If a word spans chunks, that's handled also.

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

https://stackoverflow.com/a/7745406/143880

Upvotes: 0

Jon Clements
Jon Clements

Reputation: 142106

You might be able to get away with something similar to an answer on the question you've linked to, but combining re and mmap, eg:

import mmap
import re

with open('big_file_of_words', 'r') as in_file, with open('output_file', 'w') as out_file:
    mf = mmap.mmap(in_file.fileno(), 0, access=ACCESS_READ)
    for word in re.finditer('\w+', mf):
        # do something

Upvotes: 1

Related Questions