zabop
zabop

Reputation: 7922

Cooccurence table of values of two columns with non-overlapping set of entries

import pandas as pd

Reproducible setup

I have a dataframe:

df=\
pd.DataFrame.from_dict({'A':['a','b','b','c','d','d','c','b'],
                        'B':['p','q','p','q','r','r','p','q']})

ie

   A  B
0  a  p
1  b  q
2  b  p
3  c  q
4  d  r
5  d  r
6  c  p
7  b  q

If a value appears in column B, it surely does not appear in column A and vice-versa. The post Cooccurence matrix from pandas dataframe discusses a similar setup, but there are important differences: the form of both input and output there is different (though similar information content). It also does not require the columns to have distinct set of entries, unlike my current post.


Aim

I want to create a new dataframe, which shows how many times each entry from A appeared together with each entry from A. I would like the column of this new dataframe to be entries from df's column A, so the result would look like this:

res=\
pd.DataFrame.from_dict({'B':['p','q','r'],
                        'a':[1,0,0],
                        'b':[1,2,0],
                        'c':[1,1,0],
                        'd':[0,0,2]})

ie:

   B  a  b  c  d
0  p  1  1  1  0
1  q  0  2  1  0
2  r  0  0  0  2

Question

How can I transform df to the dataframe described above?


What I've tried

A)

Using this answer to the question Group by two columns and count the occurrences of each combination in Pandas, I came up with:

df.groupby(['A','B']).size().to_frame(name = 'size').reset_index()

ie

   A  B  size
0  a  p     1
1  b  p     1
2  b  q     2
3  c  p     1
4  c  q     1
5  d  r     2

Which is similar, but in a different form. Probably could transform it to my form but probably a more efficient method exists.


B)

Based on this, tried:

df.pivot_table('A', aggfunc='count').fillna(0)

But this gives ValueError: No group keys passed!. I am unsure whether the linked post is worked where it was posted, so I abandoned this path.


C)

Based on this answer to the question Cooccurence matrix from pandas dataframe, I came up with this:

s=df.stack().str.get_dummies().sum(level=0).ne(0).astype(int)
s=s.T.dot(s).astype(float)
np.fill_diagonal(s.values, np.nan)
s=s[['a','b','c','d']].drop(index=['a','b','c','d'])
s.index.name = 'B'
s=s.astype(int)
s.reset_index(inplace=True)

ie

   B  a  b  c  d
0  p  1  1  1  0
1  q  0  2  1  0
2  r  0  0  0  2

Which is what I want. It is rather convoluted and long solution however. I would prefer a more compact one.

Upvotes: 3

Views: 198

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61930

Use crosstab:

import pandas as pd

df = pd.DataFrame.from_dict(
    {
        "A": ["a", "b", "b", "c", "d", "d", "c", "b"],
        "B": ["p", "q", "p", "q", "r", "r", "p", "q"],
    }
)

res = pd.crosstab(df['B'], df['A'])
print(res)

Output

A  a  b  c  d
B            
p  1  1  1  0
q  0  2  1  0
r  0  0  0  2

Upvotes: 6

Related Questions