Reputation: 7922
import pandas as pd
I have a dataframe:
df=\
pd.DataFrame.from_dict({'A':['a','b','b','c','d','d','c','b'],
'B':['p','q','p','q','r','r','p','q']})
ie
A B
0 a p
1 b q
2 b p
3 c q
4 d r
5 d r
6 c p
7 b q
If a value appears in column B
, it surely does not appear in column A
and vice-versa. The post Cooccurence matrix from pandas dataframe discusses a similar setup, but there are important differences: the form of both input and output there is different (though similar information content). It also does not require the columns to have distinct set of entries, unlike my current post.
I want to create a new dataframe, which shows how many times each entry from A
appeared together with each entry from A
. I would like the column of this new dataframe to be entries from df
's column A
, so the result would look like this:
res=\
pd.DataFrame.from_dict({'B':['p','q','r'],
'a':[1,0,0],
'b':[1,2,0],
'c':[1,1,0],
'd':[0,0,2]})
ie:
B a b c d
0 p 1 1 1 0
1 q 0 2 1 0
2 r 0 0 0 2
How can I transform df
to the dataframe described above?
A)
Using this answer to the question Group by two columns and count the occurrences of each combination in Pandas, I came up with:
df.groupby(['A','B']).size().to_frame(name = 'size').reset_index()
ie
A B size
0 a p 1
1 b p 1
2 b q 2
3 c p 1
4 c q 1
5 d r 2
Which is similar, but in a different form. Probably could transform it to my form but probably a more efficient method exists.
B)
Based on this, tried:
df.pivot_table('A', aggfunc='count').fillna(0)
But this gives ValueError: No group keys passed!
. I am unsure whether the linked post is worked where it was posted, so I abandoned this path.
C)
Based on this answer to the question Cooccurence matrix from pandas dataframe, I came up with this:
s=df.stack().str.get_dummies().sum(level=0).ne(0).astype(int)
s=s.T.dot(s).astype(float)
np.fill_diagonal(s.values, np.nan)
s=s[['a','b','c','d']].drop(index=['a','b','c','d'])
s.index.name = 'B'
s=s.astype(int)
s.reset_index(inplace=True)
ie
B a b c d
0 p 1 1 1 0
1 q 0 2 1 0
2 r 0 0 0 2
Which is what I want. It is rather convoluted and long solution however. I would prefer a more compact one.
Upvotes: 3
Views: 198
Reputation: 61930
Use crosstab:
import pandas as pd
df = pd.DataFrame.from_dict(
{
"A": ["a", "b", "b", "c", "d", "d", "c", "b"],
"B": ["p", "q", "p", "q", "r", "r", "p", "q"],
}
)
res = pd.crosstab(df['B'], df['A'])
print(res)
Output
A a b c d
B
p 1 1 1 0
q 0 2 1 0
r 0 0 0 2
Upvotes: 6