toothsie
toothsie

Reputation: 255

Can I use cosine similarity between rows using only non null values?

I want to find the cosine similarity (or euclidean distance if easier) between one query row, and 10 other rows. These rows are full of nan values, so if a column is nan they are to be ignored.

For example, query :

A   B   C   D   E   F
3   2  NaN  5  NaN  4

df =

A   B   C   D   E   F
2   1   3  NaN  4   5
1  NaN  2   4  NaN  3
.   .   .   .   .   .
.   .   .   .   .   .

So I just want to get the cosine similarity between every non null column that query and the rows from df have in column. So for row 0 in df A, B, and F are non null in both query and df.

I then want to print the cosine similarity for each row.

Thanks in advance

Upvotes: 5

Views: 4102

Answers (2)

For euclidean - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html This ignores nan's in it's calculations

For cosine similarity, you cannot simply fillna as this will change your similarity score. Instead, take subsets of your df and calculate the cosine similarity across columns that do not contain null values.

For your example dataframe, this would calculate cosine similarity across all rows using just columns A, & F, across query and row 1 using A, B, & F, and across query and row 2 using A, D, F. You would need to follow this up with some sort of ranking on which score to choose.

combinations = []
df.apply(lambda x: combinations.append(list(x.dropna().index)), axis=1)

# remove duplicate null combinations
combinations = [list(item) for item in set(tuple(row) for row in combinations)]

for i in combinations:
    pdist(df[i].dropna(), metric='cosine')

Upvotes: 4

cs95
cs95

Reputation: 402942

The simplest method I can think of is to use sklearn's cosine_similarity.

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df.fillna(0), df1.fillna(0))
# array([[0.51378309],
#        [0.86958199]])

The easiest way to "ignore" NaNs is to just treat them as zeros when computing similarity.

Upvotes: 0

Related Questions