user987654
user987654

Reputation: 6031

python memory error (there are enough available memory)

I'm trying to count the occurrences of strings in text files. The text files look like this, and each file is about 200MB.

String1 30
String2 100
String3 23
String1 5
.....

I want to save the counts into dict.

count  = {}
for filename in os.listdir(path):
    if(filename.endswith("idx")):
        continue
    print filename  
    f = open(os.path.join(path, filename))
    for line in f:
        (s, cnt) = line[:-1].split("\t")
        if(s not in count):
            try:
                count[s] = 0 
            except MemoryError:
                print(len(count))
                exit()
        count[s] += int(cnt)  
    f.close()
    print(len(count))

I got memory error at count[s] = 0, but I still have much more available memory in my computer.
How do I resolve this problem? Thank you!

UPDATE: I copied the actual code here. My python version is 2.4.3, and the machine is running linux and has about 48G memory, but it only consumes less than 5G. the code stops at len(count)=44739243.

UPDATE2: The strings can be duplicated (not unique string), so I want to add up all the counts for the strings. The operation I want is just reading the count for each string. There are about 10M lines per each file, and I have more than 30 files. I expect the count is less than 100 billion.

UPDATE3: the OS is linux 2.6.18.

Upvotes: 3

Views: 2873

Answers (3)

Michael Lorton
Michael Lorton

Reputation: 44436

If all you are trying to do is count the number of unique strings, you could hugely reduce your memory footprint by hashing each string:

    (s, cnt) = line[:-1].split("\t")
    s = hash(s)

Upvotes: 1

phihag
phihag

Reputation: 288298

cPython 2.4 can have problems with large memory allocations, even on x64:

$ python2.4 -c "'a' * (2**31-1)"
Traceback (most recent call last):
  File "<string>", line 1, in ?
MemoryError
$ python2.5 -c "'a' * (2**31-1)"
$

Update to a recent python interpreter (like cPython 2.7) to get around these issues, and make sure to install a 64-bit version of the interpreter.

If the strings are of nontrivial size (i.e. longer than the <10 bytes in your example), you may also want to simply store their hashes instead, or even use a probabilistic (but way more efficient) storage like a bloom filter. To store their hashes, replace the file handling loop with

import hashlib
# ...
for line in f:
    s, cnt = line[:-1].split("\t")
    idx = hashlib.md5(s).digest()
    count[idx] = count.get(idx, 0) + int(cnt)
# ...

Upvotes: 4

Michael
Michael

Reputation: 7756

I'm not really sure why this crash happens. How long is the estimated average size of your strings? 44 million strings, if they are somewhat lengthy, you should maybe consider hashing them, as already suggested. The downside is, that you loose the option to list your unique keys, you can just check, if a string is in your data or not.

Concerning the memory limit already being hit at 5 GB, maybe it's related to your outdated python version. If you have the option to update, get 2.7. Same syntax (plus some extras), no issues. Well, I don't even know if the following code is still compatible with 2.4, maybe you have to kick out the with-statement again, at least this is how you would write it in 2.7.

The main difference to your version is to run garbage collection by hand. Additionally you can raise the memory limit, that python uses. As you mentioned, it only uses a small fraction of actual ram, so in case there is some strange default setting prohibiting it to grow larger, try this:

MEMORY_MB_MAX = 30000
import gc
import os
import resource
from collections import defaultdict
resource.setrlimit(resource.RLIMIT_AS, (MEMORY_MB_MAX * 1048576L, -1L))

count  = defaultdict(int)
for filename in os.listdir(path):
    if(filename.endswith("idx")):
        continue
    print filename  
    with open(os.path.join(path, filename)) as f:
        for line in f:
            s, cnt = line[:-1].split("\t")
            count[s] += int(cnt)  
    print(len(count))
    gc.collect()

Besides that, I don't get the meaning of your line s, cnt = line[:-1].split("\t"), especially the [:-1]. If the files look like you noted, then this would erase the last digits of your numbers. Is this on purpose?

Upvotes: 1

Related Questions