Reputation: 771
I am writing a very simple script that will count the number of occurence in a file. The file size is about 300Mb (15 million lines) and has 3 columns. Since I am reading the file line by line I don't expect python to use much memory. Maximum would be slightly above 300Mb to store the count dictionnary.
However when I look at activity monitor the memory usage go above 1.5Gb. What am I doing wrong ? If it is normal, could someone explain please? Thanks
import csv
def get_counts(filepath):
with open(filepath,'rb') as csvfile:
reader = csv.DictReader(csvfile, fieldnames=['col1','col2','col3'], delimiter=',')
counts = {}
for row in reader:
key1 = int(row['col1'])
key2 = int(row['col2'])
if (key1, key2) in counts:
counts[key1, key2] += 1
else:
counts[key1, key2] = 1
return counts
Upvotes: 4
Views: 514
Reputation: 59604
I think it's quite ok, that Python uses so much memory in your case. Here is a test on my machine:
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
>>> file_size = 300000000
>>> column_count = 10
>>> average_string_size = 10
>>> row_count = file_size / (column_count * average_string_size)
>>> row_count
3000000
>>> import os, psutil, cPickle
>>> mem1 = psutil.Process(os.getpid()).memory_info().rss
>>> data = [{column_no: '*' * average_string_size for column_no in xrange(column_count)} for row_no in xrange(row_count)]
>>> mem2 = psutil.Process(os.getpid()).memory_info().rss
>>> mem2 - mem1
4604071936L
>>>
So the full list of 3000000 dict with 10 items with strings of length 10 is using more the 4GB of RAM.
In your case I don't think the csv data takes the RAM. It's your counts
dictionary.
Another explanation would be that the dicts which are read from the csv file one by one are not immediately garbage collected (though I don't affirm that).
In any case use a specialized tool to see what is taking the memory, for example https://pypi.python.org/pypi/memory_profiler
P.S. Instead of doing
if (key1, key2) in counts:
counts[key1, key2] += 1
else:
counts[key1, key2] = 1
Do
from collections import defaultdict
...
counts = defaultdict(int)
...
counts[(key1, key2)] += 1
Upvotes: 2
Reputation: 1896
try this
from collection import Counter
import csv
myreader = csv.reader( open(filename, 'r'))
Counter([each[:-1] for row in myreader] )
Hope this helps.
Upvotes: 0
Reputation: 4523
You could try something like that :
import csv
def get_counts(filepath):
data = csv.reader(open(filepath), delimiter=',')
# Remove the first line if headers
fields = data.next()
counts = {}
[count[row[0], row[1]] = count.get((row[0], row[1]), 0) + 1 for row in data]
return counts
Upvotes: 0