Memory efficency of defaultdict

Question

I was trying some examples for calculating PMI, trying to calculate for some tweet messages i had (collection of ~50k), if found out that the bottleneck of the implementation of the algorithm was in the defaultdict(lambda : defaultdict(int)), and i don't know why:

Here is the example where i profiled it and was taking a lot of memory and time

for term, n in p_t.items():
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
    semantic_orientation[term] = positive_assoc - negative_assoc

where the part:

positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)

allocates a lot of memory for some reason. I am assuming that for the values that are doesn't exist returns 0, so the array passed to the sum function is pretty big.

I solved the problem with simple if value exist and a variable sum_pos.

The whole implementation from the blog:

pmi = defaultdict(lambda : defaultdict(int))
for t1 in p_t:
    for t2 in com[t1]:
        denom = p_t[t1] * p_t[t2]
        pmi[t1][t2] = math.log2(p_t_com[t1][t2] / denom)

semantic_orientation = {}
for term, n in p_t.items():
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
    semantic_orientation[term] = positive_assoc - negative_assoc

Martijn Pieters · Accepted Answer

defaultdict will call the factory function for each and every key that is missing. If you are using it in a sum() where there are a lot of keys missing you will indeed create a whole load of dictionaries that grow to contain more keys without you using them.

Switch to using the dict.get() method here to prevent the objects being created:

positive_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in positive_vocab)
negative_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in negative_vocab)

Note that the pmi.get() call returns an empty dictionary, so that the chained dict.get() call continues to work and can return the default 0 if there is no dictionary associated with the given term.

Memory efficency of defaultdict

Answers (2)

Related Questions