Romain
Romain

Reputation: 771

Python excessive memory usage with simple script

I am writing a very simple script that will count the number of occurence in a file. The file size is about 300Mb (15 million lines) and has 3 columns. Since I am reading the file line by line I don't expect python to use much memory. Maximum would be slightly above 300Mb to store the count dictionnary.

However when I look at activity monitor the memory usage go above 1.5Gb. What am I doing wrong ? If it is normal, could someone explain please? Thanks

import csv
def get_counts(filepath):
    with open(filepath,'rb') as csvfile:
        reader = csv.DictReader(csvfile, fieldnames=['col1','col2','col3'], delimiter=',')
        counts = {}
        for row in reader:

            key1 = int(row['col1'])
            key2 = int(row['col2'])

            if (key1, key2) in counts:
                counts[key1, key2] += 1
            else:
                counts[key1, key2] = 1

    return counts

Upvotes: 4

Views: 514

Answers (3)

warvariuc
warvariuc

Reputation: 59604

I think it's quite ok, that Python uses so much memory in your case. Here is a test on my machine:

Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
>>> file_size = 300000000
>>> column_count = 10
>>> average_string_size = 10
>>> row_count = file_size / (column_count * average_string_size)
>>> row_count
3000000
>>> import os, psutil, cPickle
>>> mem1 = psutil.Process(os.getpid()).memory_info().rss
>>> data = [{column_no: '*' * average_string_size for column_no in xrange(column_count)} for row_no in xrange(row_count)]
>>> mem2 = psutil.Process(os.getpid()).memory_info().rss
>>> mem2 - mem1
4604071936L
>>>

So the full list of 3000000 dict with 10 items with strings of length 10 is using more the 4GB of RAM.

In your case I don't think the csv data takes the RAM. It's your counts dictionary.

Another explanation would be that the dicts which are read from the csv file one by one are not immediately garbage collected (though I don't affirm that).

In any case use a specialized tool to see what is taking the memory, for example https://pypi.python.org/pypi/memory_profiler

P.S. Instead of doing

        if (key1, key2) in counts:
            counts[key1, key2] += 1
        else:
            counts[key1, key2] = 1

Do

from collections import defaultdict
...
counts = defaultdict(int)
...
counts[(key1, key2)] += 1

Upvotes: 2

sam
sam

Reputation: 1896

try this

from collection import Counter
import csv

myreader = csv.reader( open(filename, 'r'))
Counter([each[:-1] for row in myreader] )

Hope this helps.

Upvotes: 0

Till
Till

Reputation: 4523

You could try something like that :

import csv

def get_counts(filepath):

    data = csv.reader(open(filepath), delimiter=',')
    # Remove the first line if headers
    fields = data.next()
    counts = {}

    [count[row[0], row[1]] = count.get((row[0], row[1]), 0) + 1 for row in data]

    return counts

Upvotes: 0

Related Questions