ctrl-alt-delor
ctrl-alt-delor

Reputation: 7745

get words from large file, using low memory in python

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.

I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).

def words(file):
    for line in file:
        words=re.split("\W+", line)
        for w in words:
            word=w.lower()
            if word != '': yield word

Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?

Upvotes: 2

Views: 874

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122242

Don't read line by line, read in buffered chunks instead:

import re

def words(file, buffersize=2048):
    buffer = ''
    for chunk in iter(lambda: file.read(buffersize), ''):
        words = re.split("\W+", buffer + chunk)
        buffer = words.pop()  # partial word at end of chunk or empty
        for word in (w.lower() for w in words if w):
            yield word

    if buffer:
        yield buffer.lower()            

I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.

If you are using Python 3.3 or newer, you can use generator delegation here:

def words(file, buffersize=2048):
    buffer = ''
    for chunk in iter(lambda: file.read(buffersize), ''):
        words = re.split("\W+", buffer + chunk)
        buffer = words.pop()  # partial word at end of chunk or empty
        yield from (w.lower() for w in words if w)

    if buffer:
        yield buffer.lower()            

Demo using a small chunk size to demonstrate this all works as expected:

>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
...     print word
... 
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur

Upvotes: 5

Related Questions