Pyspark cosinesimilarity over Dataframe

Question

I have a PySpark DataFrame, df1, that looks like:

Customer1  Customer2  v_cust1   v_cust2
   1           2         0.9      0.1
   1           3         0.3      0.4
   1           4         0.2      0.9
   2           1         0.8      0.8

I want to take the cosine similarity of the two dataframes. And have something like that

Customer1  Customer2  v_cust1   v_cust2  cosine_sim
   1           2         0.9      0.1       0.1
   1           3         0.3      0.4       0.9
   1           4         0.2      0.9       0.15
   2           1         0.8      0.8       1

I have a python function that receives number/array of numbers like this:

def cos_sim(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

How can i create the cosine_sim column in my dataframe using udf? Can i pass several columns instead of one column to the udf cosine_sim function?

Pyspark cosinesimilarity over Dataframe

Answers (1)

Related Questions