Reputation: 6031
I'm trying to count the occurrences of strings in text files. The text files look like this, and each file is about 200MB.
String1 30
String2 100
String3 23
String1 5
.....
I want to save the counts into dict.
count = {}
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
f = open(os.path.join(path, filename))
for line in f:
(s, cnt) = line[:-1].split("\t")
if(s not in count):
try:
count[s] = 0
except MemoryError:
print(len(count))
exit()
count[s] += int(cnt)
f.close()
print(len(count))
I got memory error at count[s] = 0
,
but I still have much more available memory in my computer.
How do I resolve this problem?
Thank you!
UPDATE:
I copied the actual code here.
My python version is 2.4.3, and the machine is running linux and has about 48G memory, but it only consumes less than 5G. the code stops at len(count)=44739243
.
UPDATE2: The strings can be duplicated (not unique string), so I want to add up all the counts for the strings. The operation I want is just reading the count for each string. There are about 10M lines per each file, and I have more than 30 files. I expect the count is less than 100 billion.
UPDATE3: the OS is linux 2.6.18.
Upvotes: 3
Views: 2873
Reputation: 44436
If all you are trying to do is count the number of unique strings, you could hugely reduce your memory footprint by hashing each string:
(s, cnt) = line[:-1].split("\t")
s = hash(s)
Upvotes: 1
Reputation: 288298
cPython 2.4 can have problems with large memory allocations, even on x64:
$ python2.4 -c "'a' * (2**31-1)"
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError
$ python2.5 -c "'a' * (2**31-1)"
$
Update to a recent python interpreter (like cPython 2.7) to get around these issues, and make sure to install a 64-bit version of the interpreter.
If the strings are of nontrivial size (i.e. longer than the <10 bytes in your example), you may also want to simply store their hashes instead, or even use a probabilistic (but way more efficient) storage like a bloom filter. To store their hashes, replace the file handling loop with
import hashlib
# ...
for line in f:
s, cnt = line[:-1].split("\t")
idx = hashlib.md5(s).digest()
count[idx] = count.get(idx, 0) + int(cnt)
# ...
Upvotes: 4
Reputation: 7756
I'm not really sure why this crash happens. How long is the estimated average size of your strings? 44 million strings, if they are somewhat lengthy, you should maybe consider hashing them, as already suggested. The downside is, that you loose the option to list your unique keys, you can just check, if a string is in your data or not.
Concerning the memory limit already being hit at 5 GB, maybe it's related to your outdated python version. If you have the option to update, get 2.7. Same syntax (plus some extras), no issues. Well, I don't even know if the following code is still compatible with 2.4, maybe you have to kick out the with-statement again, at least this is how you would write it in 2.7.
The main difference to your version is to run garbage collection by hand. Additionally you can raise the memory limit, that python uses. As you mentioned, it only uses a small fraction of actual ram, so in case there is some strange default setting prohibiting it to grow larger, try this:
MEMORY_MB_MAX = 30000
import gc
import os
import resource
from collections import defaultdict
resource.setrlimit(resource.RLIMIT_AS, (MEMORY_MB_MAX * 1048576L, -1L))
count = defaultdict(int)
for filename in os.listdir(path):
if(filename.endswith("idx")):
continue
print filename
with open(os.path.join(path, filename)) as f:
for line in f:
s, cnt = line[:-1].split("\t")
count[s] += int(cnt)
print(len(count))
gc.collect()
Besides that, I don't get the meaning of your line s, cnt = line[:-1].split("\t")
, especially the [:-1]
. If the files look like you noted, then this would erase the last digits of your numbers. Is this on purpose?
Upvotes: 1