Amyunimus
Amyunimus

Reputation: 1633

Transform dataframe to get co-author relationships

I have a DataFrame of authors and their papers:

     author paper
0      A     z
1      B     z
2      C     z
3      D     y
4      E     y
5      C     y
6      F     x
7      G     x
8      G     w
9      B     w

I want to get a matrix of how many papers each pair of authors has together.

   A B C D E F G
A   
B  1  
C  1 1  
D  1 0 1  
E  0 0 1 1 
F  0 0 0 0 0 
G  0 1 0 0 0 1

Is there a way to transform the DataFrame using pandas to get this results? Or is there a more efficient way (like with numpy) to do this so that it is scalable?

Upvotes: 1

Views: 152

Answers (1)

DSM
DSM

Reputation: 353179

get_dummies, which I first reached for, isn't as convenient here as hoped; needed to add an extra groupby. Instead, it's actually simpler to add a dummy column or use a custom aggfunc. For example, if we start from a df like this (note that I've added an extra paper a so that there's at least one pair who's written more than one paper together)

>>> df
   author paper
0       A     z
1       B     z
2       C     z
[...]
10      A     a
11      B     a

We can add a dummy tick column, pivot, and then use the "it's simply a dot product" observation from this question:

>>> df["dummy"] = 1
>>> dm = df.pivot("author", "paper").fillna(0)
>>> dout = dm.dot(dm.T)
>>> dout
author  A  B  C  D  E  F  G
author                     
A       2  2  1  0  0  0  0
B       2  3  1  0  0  0  1
C       1  1  2  1  1  0  0
D       0  0  1  1  1  0  0
E       0  0  1  1  1  0  0
F       0  0  0  0  0  1  1
G       0  1  0  0  0  1  2

where the diagonal counts how many papers an author has written. If you really want to obliterate the diagonal and above, we can do that too:

>>> dout.values[np.triu_indices_from(dout)] = 0
>>> dout
author  A  B  C  D  E  F  G
author                     
A       0  0  0  0  0  0  0
B       2  0  0  0  0  0  0
C       1  1  0  0  0  0  0
D       0  0  1  0  0  0  0
E       0  0  1  1  0  0  0
F       0  0  0  0  0  0  0
G       0  1  0  0  0  1  0

Upvotes: 1

Related Questions