Reputation: 2617
Say I have an absurdly large text file. I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let's say it is on the order of a few gig.
My end goal is to map it to an array of sentences (separated by '?' '!' '.' and for all intents and purposes ';') and each sentence to an array of words. I was then going to use numpy for some statistical analysis.
What would be the most scalable way to go about doing this?
PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file.
Upvotes: 1
Views: 2223
Reputation: 131550
My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. This is typically done with markup languages like HTML and XML, so you'll find a lot of parsers for those languages out there, including in the Python standard library. A simple sentence parser is something you can write yourself, though; for example:
import re, collections
sentence_terminator = re.compile(r'(?<=[.!?;])\s*')
class SentenceParser(object):
def __init__(self, filelike):
self.f = filelike
self.buffer = collections.deque([''])
def next(self):
while len(self.buffer) < 2:
data = self.f.read(512)
if not data:
raise StopIteration()
self.buffer += sentence_terminator.split(self.buffer.pop() + data)
return self.buffer.popleft()
def __iter__(self):
return self
This will only read data from the file as needed to complete a sentence. It reads in 512-byte blocks so you'll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is.
After a stream parser, my second thought would be to memory map the file. That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; after that, each sentence would start on a new line, and you'd be able to open the file and use readline()
or a for
loop to go through it line by line. But you'd still have to worry about multi-line sentences; plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file.
Upvotes: 5