Reputation: 12679
I am trying to create a co-occurrence matrix in python but looking for some efficient ways to do it.
My dataset looks like this:
total_labels = ['a','b','c','d']
occ = [['a','b'],['c','d'],['a','c'],['d'],['a','c','d']]
And I am expecting the output like this:
data_mat = [[0, 1 , 2, 1],
[1, 0, 0, 0],
[2, 0, 0, 2],
[1, 0, 2, 0]]
which is actually:
a b c d
data_mat = a [[0, 1 , 2, 1],
b [1, 0, 0, 0],
c [2, 0, 0, 2],
d [1, 0, 2, 0]]
What I have tried is:
import numpy as np
m_matrix = np.zeros([4,4])
for m in range(len(total_labels)):
for j in range(len(total_labels)):
for k in occ:
if set((total_labels[m],total_labels[j])).issubset(set(k)):
m_matrix[m,j]+=1
which is giving :
array([[3., 1., 2., 1.],
[1., 1., 0., 0.],
[2., 0., 3., 2.],
[1., 0., 2., 3.]])
But as you can see there is no connection between (a,a) and (b,b) etc ( self loops ) but it's giving values there.
How can create data_mat without using many loops?
Upvotes: 2
Views: 386
Reputation: 59579
self-merge
followed by crosstab
import pandas as pd
df = pd.DataFrame(occ).stack().rename('val').reset_index().drop(columns='level_1')
df = df.merge(df, on='level_0').query('val_x != val_y')
pd.crosstab(df.val_x, df.val_y)
val_y a b c d
val_x
a 0 1 2 1
b 1 0 0 0
c 2 0 0 2
d 1 0 2 0
If need only those labels you supplied can do:
(pd.crosstab(df.val_x, df.val_y)
.reindex(total_labels, axis=0).reindex(total_labels, axis=1))
Or filter before the merge (probably smarter):
df = df.loc[df.val.isin(total_labels)]
Upvotes: 1