Reputation: 11
I have a file that is 699Mb containing N-Grams, I have tried the following code to count the frequency of each N-Gram however when i run it i get a memory error or the program crashes.
from collections import Counter
import sys
c = Counter()
with open('Output.txt') as myfile:
for line in myfile:
c.update(line.split())
print(c)
with open('NGramCountOutput.txt', 'w') as outgrams:
outgrams.write(str(c))
Can anyone suggest a more elegant solution to solve this problem or suggest another approach.
Upvotes: 1
Views: 1052
Reputation: 155744
Psychic debugging: Your input file is actually a single line, containing all the ngrams. So when you do:
for line in myfile:
c.update(line.split())
it's actually reading in the whole file as a single "line", then splitting it into a list
of all the ngrams. Problem is, that means a huge memory cost to store individual copies of all the ngrams momentarily before they get dedup-ed in Counter
(a three letter ASCII str
in Python 3.5 x64 uses ~52 bytes, plus another 8 bytes for the reference to it in the resulting list
; if you read in a line that's 699 MB of three letter strings with a space between each of them, then split it, you'll produce ~183 million of these strings, which means a rough lower bound on memory usage would be 183000000 * 60, or around 10 GB of memory. The costs would be lower on 32 bit machines, but not more than 50% (and likely less); on a 32 bit machine, you don't have enough virtual memory address space to store 5 GB (most 32 bit machines are limited to 2 GB).
The easiest fix would be to split your file up to put each ngram on its own line (or limit the number of ngrams per line to a reasonable number). For example, with tr
(on UNIX-like machines), the conversion is trivial:
tr ' ' '\n' < Output.txt > OutputNewlines.txt
Similar approaches can be used in many text editors with find/replace.
If that's not an option, you'll want to explicitly read block by block, rather than line-by-line, processing all stuff before the last space, saving what remains, then reading another block.
from functools import partial
c = Counter()
with open('Output.txt') as myfile:
remainder = ''
# Read file by fixed size blocks, not lines, assuming no ngram is larger than 8192
for block in iter(partial(myfile.read, 8192), ''):
# Split off the last whitespace separated piece (might span to next block)
parts = (remainder + block).rsplit(None, 1)
# Handle block with and without whitespace identically; no whitespace means
# probably handling EOF, just process what we've got and set remainder empty
toprocess, remainder = (parts + [''])[:2]
c.update(toprocess.split())
c.update(remainder.split()) # Add whatever was left over
That should limit maximum memory usage to be proportionate to the number of unique ngrams, not the number of total, non-unique ngrams on a line.
If you have relatively few unique ngrams, then that's enough. If you have a lot of unique ngrams though, stringifying the Counter
could cost a lot of memory too (though the Counter
itself would be using far more, the str
would just be the straw that broke the camel's back). A simple way to print the counts one per line is:
from itertools import starmap
with open('NGramCountOutput.txt', 'w') as outgrams:
# On Python 2, use .iteritems() instead of .items() to avoid large temp list
# If a temp list is okay, and you want sorted output by count,
# use .most_common() over .items()
outgrams.writelines(starmap('{} {}\n'.format, c.items()))
Upvotes: 1
Reputation: 1975
Try iterating over c
instead of stringifying it in-memory:
for k in c:
outgrams.write("{0}: {1}".format(k, c[k]))
Upvotes: 3