Reputation: 1775
I am trying to do some analytics against a large dictionary created by reading a file from disk. The read operation results in a stable memory footprint. I then have a method which performs some calculations based on data I copy out of that dictionary into a temporary dictionary. I do this so that all the copying and data use is scoped in the method, and would, I had hoped, disappear at the end of the method call.
Sadly, I am doing something wrong. The customerdict definition is as follows (defined at top of .py variable):
customerdict = collections.defaultdict(dict)
The format of the object is {customerid: dictionary{id: 0||1}}
There is also a similarly defined dictionary called allids.
I have a method for calculating the sim_pearson distance (modified code from Programming Collective Intelligence book), which is below.
def sim_pearson(custID1, custID2):
si = []
smallcustdict = {}
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#get the list of mutually rated items
for id in smallcustdict[custID1]:
if id in smallcustdict[custID2]:
si.append(id) # = 1
#return 0 if there are no matches
if len(si) == 0: return 0
#add up all the preferences
sum1 = sum([smallcustdict[custID1][id] for id in si])
sum2 = sum([smallcustdict[custID2][id] for id in si])
#sum up the squares
sum1sq = sum([pow(smallcustdict[custID1][id],2) for id in si])
sum2sq = sum([pow(smallcustdict[custID2][id],2) for id in si])
#sum up the products
psum = sum([smallcustdict[custID1][id] * smallcustdict[custID2][id] for id in si])
#calc Pearson score
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
del smallcustdict
del si
del sum1
del sum2
del sum1sq
del sum2sq
del psum
if den == 0:
return 0
return num/den
Every loop through the sim_pearson method grows the memory footprint of python.exe unbounded. I tried using the "del" method to explicitly delete local scoped variables.
Looking at taskmanager, the memory is jumping up at 6-10Mb increments. Once the initial customerdict is setup, the footprint is 137Mb.
Any ideas why I am running out of memory doing it this way?
Upvotes: 0
Views: 1188
Reputation: 123541
Try changing the following two lines:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
to
smallcustdict[custID1] = customerdict[custID1].copy()
smallcustdict[custID2] = customerdict[custID2].copy()
That way the changes you make to the two dictionaries do not persist in customerdict
when the sim_pearson()
function returns.
Upvotes: 1
Reputation: 89097
I presume the issue is here:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
The dictionaries from customerdict
are being referenced in smallcustdict
- so when you add to them, you they persist. This is the only point that I can see where you do anything that will persist out of scope, so I would imagine this is the problem.
Note you are making a lot of work for yourself in many places by not using list comps, doing the same thing repeatedly, and not making generic ways to do things, a better version might be as follows:
import collections
import functools
import operator
customerdict = collections.defaultdict(dict)
def sim_pearson(custID1, custID2):
#Declaring as a dict literal is nicer.
smallcustdict = {
custID1: customerdict[custID1],
custID2: customerdict[custID2],
}
# Unchanged, as I'm not sure what the intent is here.
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#dict views are set-like, so the easier way to do what you want is the intersection of the two.
si = smallcustdict[custID1].viewkeys() & smallcustdict[custID2].viewkeys()
#if not is a cleaner way of checking for no values.
if not si:
return 0
#Made more generic to avoid repetition and wastefully looping repeatedly.
parts = [list(part) for part in zip(*((value[id] for value in smallcustdict.values()) for id in si))]
sums = [sum(part) for part in parts]
sumsqs = [sum(pow(i, 2) for i in part) for part in parts]
psum = sum(functools.reduce(operator.mul, part) for part in zip(*parts))
sum1, sum2 = sums
sum1sq, sum2sq = sumsqs
#Unchanged.
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
#Again using if not.
if not den:
return 0
else:
return num/den
Note that this is entirely untested as the code you gave isn't a complete example. However, It should be easy enough to use as a basis for improvement.
Upvotes: 3