Reputation: 3523
I was trying some examples for calculating PMI, trying to calculate for some tweet messages i had (collection of ~50k), if found out that the bottleneck of the implementation of the algorithm was in the defaultdict(lambda : defaultdict(int))
, and i don't know why:
Here is the example where i profiled it and was taking a lot of memory and time
for term, n in p_t.items():
positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
semantic_orientation[term] = positive_assoc - negative_assoc
where the part:
positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
allocates a lot of memory for some reason. I am assuming that for the values that are doesn't exist returns 0, so the array passed to the sum function is pretty big.
I solved the problem with simple if value exist
and a variable sum_pos
.
The whole implementation from the blog:
pmi = defaultdict(lambda : defaultdict(int))
for t1 in p_t:
for t2 in com[t1]:
denom = p_t[t1] * p_t[t2]
pmi[t1][t2] = math.log2(p_t_com[t1][t2] / denom)
semantic_orientation = {}
for term, n in p_t.items():
positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
semantic_orientation[term] = positive_assoc - negative_assoc
Upvotes: 0
Views: 349
Reputation: 920
I like Martjin's answer... but this should also work, and you might find it more readable.
positive_assoc = sum(pmi[term][tx] for tx in positive_vocab if term in pmi and
tx in pmi[term)
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab if term in pmi and
tx in pmi[term)
Upvotes: 1
Reputation: 1122292
defaultdict
will call the factory function for each and every key that is missing. If you are using it in a sum()
where there are a lot of keys missing you will indeed create a whole load of dictionaries that grow to contain more keys without you using them.
Switch to using the dict.get()
method here to prevent the objects being created:
positive_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in positive_vocab)
negative_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in negative_vocab)
Note that the pmi.get()
call returns an empty dictionary, so that the chained dict.get()
call continues to work and can return the default 0
if there is no dictionary associated with the given term
.
Upvotes: 3