ImAUser
ImAUser

Reputation: 129

How do I create a co-occurrance matrix in Python?

I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.

For example, my dataframce can be something like (N=3):

    A   B   C
0   0   2   0
1   1   0   1
2   2   2   0
3   2   0   0
4   0   0   0

I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.

    A   B   C
A   x   2   3
B   2   x   2
C   3   2   x

Where, for example, matrix[0, 1] means that A and B assume the same value 2 times. I don't care about the value on the diagonal.

What is the smartest way to do that?

Upvotes: 2

Views: 211

Answers (3)

Shubham Sharma
Shubham Sharma

Reputation: 71689

DataFrame.corr

We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other

df.corr(method=lambda x, y: (x==y).sum())

     A    B    C
A  1.0  2.0  3.0
B  2.0  1.0  2.0
C  3.0  2.0  1.0

Upvotes: 2

Henry Ecker
Henry Ecker

Reputation: 35626

Let's try broadcasting across the transposition and summing axis 2:

import pandas as pd

df = pd.DataFrame({
    'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
    'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
    'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})

vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)

new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)

e:

[[5 2 3]
 [2 5 2]
 [3 2 5]]

Turn back into a dataframe:

new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)

new_df:

   A  B  C
A  5  2  3
B  2  5  2
C  3  2  5

Upvotes: 1

aaronn
aaronn

Reputation: 478

I don't know about the smartest way but I think this works:

import numpy as np

m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3

ans = np.zeros((n, n))
for i in range(n):
    for j in range(i+1, n):
        ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])

print(ans + ans.T)

Upvotes: 0

Related Questions