Alex
Alex

Reputation: 603

PYSPARK: How to find cosine similarity of two columns in a pyspark dataframe?

How can I find cosine similarity between two columns in a pyspark dataframe?

Suppose I have a spark dataframe

|a |b |
+--+--|
|1 |4 |
|2 |5 |
|3 |6 |
+--+--+

Now I want to know what is the cosine similarity between values in column a and the ones in column b, i.e.,

cosine_similarity([1, 2, 3], [4, 5, 6]) 

Upvotes: 2

Views: 6645

Answers (1)

titipata
titipata

Reputation: 5389

I assume that you want to find similarity between 2 columns. Says you have this dataframe:

df = spark.createDataFrame(pd.DataFrame([[1,2], [3,4]], columns=['a', 'b']))

Make simple function to take dataframe and two column names.

import pyspark.sql.functions as func

def cosine_similarity(df, col1, col2):
    df_cosine = df.select(func.sum(df[col1] * df[col2]).alias('dot'), 
                          func.sqrt(func.sum(df[col1]**2)).alias('norm1'), 
                          func.sqrt(func.sum(df[col2] **2)).alias('norm2'))
    d = df_cosine.rdd.collect()[0].asDict()
    return d['dot']/(d['norm1'] * d['norm2'])

cosine_similarity(df, 'a', 'b') # output 0.989949

Upvotes: 8

Related Questions