Python memory management with list comprehensions

Question

I am trying to do some analytics against a large dictionary created by reading a file from disk. The read operation results in a stable memory footprint. I then have a method which performs some calculations based on data I copy out of that dictionary into a temporary dictionary. I do this so that all the copying and data use is scoped in the method, and would, I had hoped, disappear at the end of the method call.

Sadly, I am doing something wrong. The customerdict definition is as follows (defined at top of .py variable):

customerdict = collections.defaultdict(dict)

The format of the object is {customerid: dictionary{id: 0||1}}

There is also a similarly defined dictionary called allids.

I have a method for calculating the sim_pearson distance (modified code from Programming Collective Intelligence book), which is below.

def sim_pearson(custID1, custID2):
si = []

smallcustdict = {}
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]

#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
    for id in allids:
        if id not in catalog:
            smallcustdict[customerID][asin] = 0.0

#get the list of mutually rated items
for id in smallcustdict[custID1]:
    if id in smallcustdict[custID2]:
        si.append(id) # = 1

#return 0 if there are no matches
if len(si) == 0: return 0

#add up all the preferences
sum1 = sum([smallcustdict[custID1][id] for id in si])
sum2 = sum([smallcustdict[custID2][id] for id in si])

#sum up the squares
sum1sq = sum([pow(smallcustdict[custID1][id],2) for id in si])
sum2sq = sum([pow(smallcustdict[custID2][id],2) for id in si])

#sum up the products
psum = sum([smallcustdict[custID1][id] * smallcustdict[custID2][id] for id in si])

#calc Pearson score
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))

del smallcustdict
del si
del sum1
del sum2
del sum1sq
del sum2sq
del psum

if den == 0:
    return 0

return num/den

Every loop through the sim_pearson method grows the memory footprint of python.exe unbounded. I tried using the "del" method to explicitly delete local scoped variables.

Looking at taskmanager, the memory is jumping up at 6-10Mb increments. Once the initial customerdict is setup, the footprint is 137Mb.

Any ideas why I am running out of memory doing it this way?

martineau · Accepted Answer

Try changing the following two lines:

smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]

to

smallcustdict[custID1] = customerdict[custID1].copy()
smallcustdict[custID2] = customerdict[custID2].copy()

That way the changes you make to the two dictionaries do not persist in customerdict when the sim_pearson() function returns.

Python memory management with list comprehensions

Answers (2)

Related Questions