Iterating through Python dictionary very slow

Question

I'm trying to compute the distance between pairs of users based on values of items that are assigned to them. However the distance calculation should be null when the two users do not have any intersecting items. I'm also calculating the lower half of the distance matrix only (eg. UserA-UserB is equivalent to UserB-UserA so only calculate one).

So I have this following Python script that works, but it really starts chugging when I feed it more than a few hundred users. The sample script below shows the input structure, but I'm trying to do thousands, not just the four I have shown here.

The line s = {k:v for k,v in data.items() if k in (user1,user2)} seems to add the most overhead

import math
from decimal import *

def has_matching_product(data,user1,user2):
    c1=set(data[user1].keys())
    c2=[k for k in data[user2].keys()]
    return any([x in c1 for x in c2])

def get_euclidean_dist(data,user1,user2):
    #Tried subsetting to run quicker?
    s = {k:v for k,v in data.items() if k in (user1,user2)}

    #Ignore users with no overlapping items
    if has_matching_product(s,user1,user2):
        items=set()
        for k,v in s.items():
            for ki in v.keys():
                items.add(ki)

        rs=Decimal(0)
        for i in items:
            p1 = s.get(user1).get(i)
            p2 = s.get(user2).get(i)
            v1 = p1 or 0
            v2 = p2 or 0

            rs+= Decimal((v1-v2)**2)
        return math.sqrt(rs)
    else:
        return None

#User/Product/Value
raw_data = {
    'U1': {
        'I1':5,
        'I4':2
    },
    'U2': {
        'I1':1,
        'I3':6
    },
    'U3': {
        'I3':11
    },
    'U4': {
        'I4':9
    }
}

users = sorted(raw_data.keys())
l = len(users)

data_out = set()
#Compute lower half of a distance matrix (unique pairs only)
for u1 in range(0,l-1):
    for u2 in range(1+u1,l):
        dist = get_euclidean_dist(raw_data,users[u1],users[u2])
        print('{x} | {y} | {d}'.format(x=users[u1],y=users[u2],d=dist)) #Sample output

What the proper output should look like:

U1 | U2 | 7.483314773547883
U1 | U3 | None
U1 | U4 | 8.602325267042627
U2 | U3 | 5.0990195135927845
U2 | U4 | None
U3 | U4 | None

dwanderson · Accepted Answer

The issue is that you're walking the ENTIRE dictionary every time, just to find the two items you want. And from the looks of it, you're pulling out the users, and then spending all that time trying to go find them again in data. @Peter Wood's suggestion will help a bunch - only grab the two users you want in the first place, but that's sort of missing the forest from the trees - you don't need to slim down your dictionary in the first place at all. Keep it all together:

import itertools
for kv1, kv2 in itertools.combinations(data.items(), 2):
    ## calculate distance directly here

Iterating through Python dictionary very slow

Answers (2)

Related Questions