Alex Eftimiades
Alex Eftimiades

Reputation: 2617

Efficient way to read/write/parse large text files (python)

Say I have an absurdly large text file. I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let's say it is on the order of a few gig.

My end goal is to map it to an array of sentences (separated by '?' '!' '.' and for all intents and purposes ';') and each sentence to an array of words. I was then going to use numpy for some statistical analysis.

What would be the most scalable way to go about doing this?

PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file.

Upvotes: 1

Views: 2223

Answers (1)

David Z
David Z

Reputation: 131550

My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. This is typically done with markup languages like HTML and XML, so you'll find a lot of parsers for those languages out there, including in the Python standard library. A simple sentence parser is something you can write yourself, though; for example:

import re, collections
sentence_terminator = re.compile(r'(?<=[.!?;])\s*')
class SentenceParser(object):
    def __init__(self, filelike):
        self.f = filelike
        self.buffer = collections.deque([''])
    def next(self):
        while len(self.buffer) < 2:
            data = self.f.read(512)
            if not data:
                raise StopIteration()
            self.buffer += sentence_terminator.split(self.buffer.pop() + data)
        return self.buffer.popleft()
    def __iter__(self):
        return self

This will only read data from the file as needed to complete a sentence. It reads in 512-byte blocks so you'll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is.

After a stream parser, my second thought would be to memory map the file. That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; after that, each sentence would start on a new line, and you'd be able to open the file and use readline() or a for loop to go through it line by line. But you'd still have to worry about multi-line sentences; plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file.

Upvotes: 5

Related Questions