bagel123
bagel123

Reputation: 11

Python counting ngram frequency in large files

I have a file that is 699Mb containing N-Grams, I have tried the following code to count the frequency of each N-Gram however when i run it i get a memory error or the program crashes.

from collections import Counter
import sys

c = Counter()
with open('Output.txt') as myfile:
    for line in myfile:
        c.update(line.split())

print(c)

with open('NGramCountOutput.txt', 'w') as outgrams:
    outgrams.write(str(c))

Can anyone suggest a more elegant solution to solve this problem or suggest another approach.

Upvotes: 1

Views: 1052

Answers (2)

ShadowRanger
ShadowRanger

Reputation: 155744

Psychic debugging: Your input file is actually a single line, containing all the ngrams. So when you do:

for line in myfile:
    c.update(line.split())

it's actually reading in the whole file as a single "line", then splitting it into a list of all the ngrams. Problem is, that means a huge memory cost to store individual copies of all the ngrams momentarily before they get dedup-ed in Counter (a three letter ASCII str in Python 3.5 x64 uses ~52 bytes, plus another 8 bytes for the reference to it in the resulting list; if you read in a line that's 699 MB of three letter strings with a space between each of them, then split it, you'll produce ~183 million of these strings, which means a rough lower bound on memory usage would be 183000000 * 60, or around 10 GB of memory. The costs would be lower on 32 bit machines, but not more than 50% (and likely less); on a 32 bit machine, you don't have enough virtual memory address space to store 5 GB (most 32 bit machines are limited to 2 GB).

The easiest fix would be to split your file up to put each ngram on its own line (or limit the number of ngrams per line to a reasonable number). For example, with tr (on UNIX-like machines), the conversion is trivial:

tr ' ' '\n' < Output.txt > OutputNewlines.txt

Similar approaches can be used in many text editors with find/replace.

If that's not an option, you'll want to explicitly read block by block, rather than line-by-line, processing all stuff before the last space, saving what remains, then reading another block.

from functools import partial

c = Counter()
with open('Output.txt') as myfile:
    remainder = ''
    # Read file by fixed size blocks, not lines, assuming no ngram is larger than 8192
    for block in iter(partial(myfile.read, 8192), ''):
        # Split off the last whitespace separated piece (might span to next block)
        parts = (remainder + block).rsplit(None, 1)
        # Handle block with and without whitespace identically; no whitespace means
        # probably handling EOF, just process what we've got and set remainder empty
        toprocess, remainder = (parts + [''])[:2]
        c.update(toprocess.split())
    c.update(remainder.split())  # Add whatever was left over

That should limit maximum memory usage to be proportionate to the number of unique ngrams, not the number of total, non-unique ngrams on a line.

If you have relatively few unique ngrams, then that's enough. If you have a lot of unique ngrams though, stringifying the Counter could cost a lot of memory too (though the Counter itself would be using far more, the str would just be the straw that broke the camel's back). A simple way to print the counts one per line is:

from itertools import starmap

with open('NGramCountOutput.txt', 'w') as outgrams:
    # On Python 2, use .iteritems() instead of .items() to avoid large temp list
    # If a temp list is okay, and you want sorted output by count,
    # use .most_common() over .items()
    outgrams.writelines(starmap('{} {}\n'.format, c.items()))

Upvotes: 1

Norman
Norman

Reputation: 1975

Try iterating over c instead of stringifying it in-memory:

for k in c:
    outgrams.write("{0}: {1}".format(k, c[k]))

Upvotes: 3

Related Questions