Andy
Andy

Reputation: 29

Pairwise similarity

I have pandas dataframe that looks like this:

df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
                                                   ['B', 'C', 'D', 'E'],
                                                   ['E', 'F', 'G', 'H'],
                                                   ['A', 'A', 'E', 'F']]})

name    cards
0       ['A', 'B', 'C', 'D']
1       ['B', 'C', 'D', 'E']
2       ['E', 'F', 'G', 'H']
3       ['A', 'A', 'E', 'F']

And I'd like to create a matrix that looks like this:

    name  0    1    2    3
name
0         4    3    0    1
1         3    4    1    1
2         0    1    4    2
3         1    1    2    4

Where the values are the number of items in common.

Any ideas?

Upvotes: 2

Views: 137

Answers (3)

Mehrdad Dowlatabadi
Mehrdad Dowlatabadi

Reputation: 1335

By list comprehension and iterate through all pairs we can make the result:

import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
                                               ['B', 'C', 'D', 'E'],
                                               ['E', 'F', 'G', 'H'],
                                               ['A', 'A', 'E', 'F']]})
result=[[len(list(set(x) & set(y))) for x in df['cards']] for y in  df['cards']]


print(result)

output :

[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 3]]

'&' is used to calculate intersection of two sets

This is exactly what you want:

import pandas as pd
df = pd.DataFrame({'name': [0, 1, 2, 3], 'cards': [['A', 'B', 'C', 'D'],
                                                    ['B', 'C', 'D', 'E'],
                                                    ['E', 'F', 'G', 'H'],
                                                    ['A', 'A', 'E', 'F']]})
result=[[len(x)-max(len(set(y) -  set(x)),len(set(x) -  set(y))) for x in df['cards']] for y in  df['cards']]


print(result)

output:

[[4, 3, 0, 1], [3, 4, 1, 1], [0, 1, 4, 2], [1, 1, 2, 4]]

Upvotes: 1

Atendra Gautam
Atendra Gautam

Reputation: 475

import pandas as pd
import numpy as np


df = pd.DataFrame([['A', 'B', 'C', 'D'],
                   ['B', 'C', 'D', 'E'],
                   ['E', 'F', 'G', 'H'],
                   ['A', 'A', 'E', 'F']])


nrows = df.shape[0]
# Initialization
matrix = np.zeros((nrows,nrows),dtype= np.int64)


for i in range(0,nrows):
    for j in range(0,nrows):
        matrix[i,j] = sum(df.iloc[:,i] == df.iloc[:,j])

output

print(matrix)

[[4 1 0 0]
 [1 4 0 0]
 [0 0 4 0]
 [0 0 0 4]]

Upvotes: 0

Ananay Mital
Ananay Mital

Reputation: 1475

Using .apply method and lambda we can directly get a dataframe

def func(df, j):
    return pd.Series([len(set(i)&set(j)) for i in df.cards])

newdf = df.cards.apply(lambda x: func(df, x))
newdf

    0   1   2   3
0   4   3   0   1
1   3   4   1   1
2   0   1   4   2
3   1   1   2   3

Upvotes: 1

Related Questions