Reputation: 1793
I'm trying to compute the distance between pairs of users based on values of items that are assigned to them. However the distance calculation should be null when the two users do not have any intersecting items. I'm also calculating the lower half of the distance matrix only (eg. UserA-UserB is equivalent to UserB-UserA so only calculate one).
So I have this following Python script that works, but it really starts chugging when I feed it more than a few hundred users. The sample script below shows the input structure, but I'm trying to do thousands, not just the four I have shown here.
The line s = {k:v for k,v in data.items() if k in (user1,user2)}
seems to add the most overhead
import math
from decimal import *
def has_matching_product(data,user1,user2):
c1=set(data[user1].keys())
c2=[k for k in data[user2].keys()]
return any([x in c1 for x in c2])
def get_euclidean_dist(data,user1,user2):
#Tried subsetting to run quicker?
s = {k:v for k,v in data.items() if k in (user1,user2)}
#Ignore users with no overlapping items
if has_matching_product(s,user1,user2):
items=set()
for k,v in s.items():
for ki in v.keys():
items.add(ki)
rs=Decimal(0)
for i in items:
p1 = s.get(user1).get(i)
p2 = s.get(user2).get(i)
v1 = p1 or 0
v2 = p2 or 0
rs+= Decimal((v1-v2)**2)
return math.sqrt(rs)
else:
return None
#User/Product/Value
raw_data = {
'U1': {
'I1':5,
'I4':2
},
'U2': {
'I1':1,
'I3':6
},
'U3': {
'I3':11
},
'U4': {
'I4':9
}
}
users = sorted(raw_data.keys())
l = len(users)
data_out = set()
#Compute lower half of a distance matrix (unique pairs only)
for u1 in range(0,l-1):
for u2 in range(1+u1,l):
dist = get_euclidean_dist(raw_data,users[u1],users[u2])
print('{x} | {y} | {d}'.format(x=users[u1],y=users[u2],d=dist)) #Sample output
What the proper output should look like:
U1 | U2 | 7.483314773547883
U1 | U3 | None
U1 | U4 | 8.602325267042627
U2 | U3 | 5.0990195135927845
U2 | U4 | None
U3 | U4 | None
Upvotes: 1
Views: 557
Reputation: 2832
The issue is that you're walking the ENTIRE dictionary every time, just to find the two items you want. And from the looks of it, you're pulling out the user
s, and then spending all that time trying to go find them again in data
. @Peter Wood's suggestion will help a bunch - only grab the two user
s you want in the first place, but that's sort of missing the forest from the trees - you don't need to slim down your dictionary in the first place at all. Keep it all together:
import itertools
for kv1, kv2 in itertools.combinations(data.items(), 2):
## calculate distance directly here
Upvotes: 3
Reputation: 42748
You are using decimal, which is not very fast. Dictionaries are already a set of keys, so creating extra sets is not neccessary.
You create a list with any
which must calculate all values, use a generator instead.
You are using get
so you can provide a default value.
So i get this:
import math
def get_euclidean_dist(data,user1,user2):
c1 = data[user1]
c2 = data[user2]
#Ignore users with no overlapping items
if any(x in c1 for x in c2):
items = set(c1)
items.update(c2)
rs = sum((c1.get(i, 0)-c2.get(i, 0))**2 for i in items)
return math.sqrt(rs)
else:
return None
Upvotes: 2