Guido Muscioni
Guido Muscioni

Reputation: 1295

Python - speed up cosine similarity with scipy

the following question arises from a previous that I have made before: Python - How to speed up cosine similarity with counting arrays

I am facing a big complexity issue when using the proposed solution, basically, my implementation takes a lot of time to build a cosine similarity matrix. Below the code that I am using:

import numpy as np
import pandas as pd
import networkx as nx
from scipy import spatial

def compute_other(user_1, user_2):
    uniq = list(set(user_1[0] + user_2[0]))

    duniq = {k:0 for k in uniq}    

    u1 = create_vector(duniq, list(user_1[0]))
    u2 = create_vector(duniq, list(user_2[0]))

    return 1 - spatial.distance.cosine(u1, u2)

# START
distances = spatial.distance.cdist(df[['ARTIST']], df[['ARTIST']], metric=compute_other)

idx_to_remove = np.triu_indices(len(distances))
distances[idx_to_remove] = 0

df_dist = pd.DataFrame(distances, index = df.index, columns = df.index)
edges = df_dist.stack().to_dict()
edges = {k: v for k, v in edges.items() if v > 0}

print('NET inference')
net = nx.Graph()
net.add_nodes_from(df.index)
net.add_edges_from(edges)     

The first thing that I am noticing is that I compute the complete matrix and I remove half of it, so it would be cool to compute only the half that I need (that is going to be a x2).

That the structure of df:

ARTIST
"(75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 15053)"
"(55852, 55852, 17727, 17727, 2182)"
"(11446, 11446, 11446, 11446, 11446, 11446, 11446, 11446)"
"(54795,)"
"(22873, 22873, 22873, 22873)"
"(5634, 5634)"
"(311, 18672)"
"(1740, 1740, 1740, 1740, 1746, 15048, 15048, 1740)"
"(1788, 1983, 1788, 1748, 723, 100744, 723, 226, 1583, 12188, 51325, 1748, 75401, 1171)"
"(59173, 59173)"
"(2673, 2673, 2673, 2673, 2673, 2673, 2673, 5634, 5634, 5634)"
"(2251, 4229, 14207, 1744, 16366, 1218)"
"(19703, 1171, 1171)"
"(12877,)"
"(1243, 8249, 2061, 1243, 13343, 9868, 574509, 892, 1080, 1243, 3868, 2061, 4655)"
"(1229,)"
"(3868, 60112, 11084)"
"(15869, 15869, 15869, 15869)"
"(4067, 4067, 4067, 4067, 4067, 4067)"
"(1171, 1171, 1171, 1171)"
"(1245, 1245, 1245, 1245, 1245, 1245, 1245, 1245, 1245, 1195, 1193, 1193, 1193, 1193, 1193, 1193)"
"(723, 723)"  

This dataset is complete and can be used with the code I posted. Just read it as a normal csv with pandas and apply that function:

import ast
import pandas as pd

df = pd.read_csv('Stack.csv')
df['ARTIST'] = df['ARTIST'].apply(lambda x : ast.literal_eval(x))

This code is executed in almost 166. I am executing 8 processes in parallel on my 8-core processor, each process computes the same function over a different dataset. I honestly do not know if that is already the most optimized version, however, also removing half of the computation as I explained before would be really useful (going from 166 to 83).

EDIT: Belowe the create_vector function:

def create_vector(duniq, l):
    dx = duniq.copy()
    dx.update(Counter(l)) # Count the values
    return list(dx.values()) # Return a list

Upvotes: 0

Views: 650

Answers (1)

Ethan
Ethan

Reputation: 1373

I was trying to tinker with this, however I am getting a compile error on the two lines: u1 = create_vector(duniq, list(user_1[0])) u2 = create_vector(duniq, list(user_2[0]))

is create_vector() a def you built but did not post?

I suspect using a mask on you df would probably improve performance by removing the overwriting you are doing with distances[idx_to_remove] = 0 and should reduce the number of iteration in "edges = {k: v for k, v in edges.items() if v > 0}"

If you can post where create_vector() is coming from, or the def itself, I'f like to test a mask. It's an interesting problem.

Hi @Guido. Apologies for taking so long, but this has been tough nut to crack! After trying a few different things (that took even longer) I came up with the following to used in place of your both your create_vector() and compute_other() functions:

def compute_other2(user_1, user_2):
    uniq = set(user_1[0] + user_2[0]) #create list of unique list of items in user _1 and user_2   
    u1 = [user_1[0].count(ui) for ui in uniq]
    u2 = [user_2[0].count(ui) for ui in uniq]
    return 1 - spatial.distance.cosine(u1, u2)

I got a 20% performance improvement, less than I had hoped, but something. Note: I am still running your code with "spatial.distance.cdist". I did see that you gained 50% by switching to "spatial.distance.pdist". I'm not sure how you used it and (what I suspect is vector math) is beyond my ken. Maybe you can use this new compute_other() function with spatial.distance.pdist and gain a bit more.

P.S. if you try this please verify the results. I checked mine against your original code and it appears to be correct to me.

Upvotes: 1

Related Questions