Reputation: 305
Let's say that I have a dataset that contains 4 binary columns for 2 rows.
It looks like this:
c1 c2 c3 c4 c5
r1 0 1 0 1 0
r2 1 1 1 1 0
I want to create a matrix that gives the number of occurrences of a column, given that it also occurred in another column. Kinda like a confusion matrix
My desired output is:
c1 c2 c3 c4 c5
c1 - 1 1 1 0
c2 1 - 1 2 0
c3 1 1 - 1 0
c4 1 2 1 - 0
I have used pandas crosstab but it only gives the desired output when using 2 columns. I want to use all of the columns
Upvotes: 3
Views: 251
Reputation: 323236
A way of using melt
and merge
with groupby
s=df.reset_index().melt('index').loc[lambda x : x.value==1]
s.merge(s,on='index').query('variable_x!=variable_y').groupby(['variable_x','variable_y'])['value_x'].sum().unstack(fill_value=0)
Out[32]:
variable_y c1 c2 c3 c4
variable_x
c1 0 1 1 1
c2 1 0 1 2
c3 1 1 0 1
c4 1 2 1 0
Upvotes: 3
Reputation: 294258
dot
df.T.dot(df)
# same as
# df.T @ df
c1 c2 c3 c4 c5
c1 1 1 1 1 0
c2 1 2 1 2 0
c3 1 1 1 1 0
c4 1 2 1 2 0
c5 0 0 0 0 0
You can use np.fill_diagonal
to make the diagonal zero
d = df.T.dot(df)
np.fill_diagonal(d.to_numpy(), 0)
d
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
And as long as we're using Numpy, you could go all the way...
a = df.to_numpy()
b = a.T @ a
np.fill_diagonal(b, 0)
pd.DataFrame(b, df.columns, df.columns)
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
Upvotes: 7