Reputation: 1633
I have a DataFrame of authors and their papers:
author paper
0 A z
1 B z
2 C z
3 D y
4 E y
5 C y
6 F x
7 G x
8 G w
9 B w
I want to get a matrix of how many papers each pair of authors has together.
A B C D E F G
A
B 1
C 1 1
D 1 0 1
E 0 0 1 1
F 0 0 0 0 0
G 0 1 0 0 0 1
Is there a way to transform the DataFrame using pandas to get this results? Or is there a more efficient way (like with numpy) to do this so that it is scalable?
Upvotes: 1
Views: 152
Reputation: 353179
get_dummies
, which I first reached for, isn't as convenient here as hoped; needed to add an extra groupby
. Instead, it's actually simpler to add a dummy column or use a custom aggfunc. For example, if we start from a df
like this (note that I've added an extra paper a
so that there's at least one pair who's written more than one paper together)
>>> df
author paper
0 A z
1 B z
2 C z
[...]
10 A a
11 B a
We can add a dummy tick column, pivot, and then use the "it's simply a dot product" observation from this question:
>>> df["dummy"] = 1
>>> dm = df.pivot("author", "paper").fillna(0)
>>> dout = dm.dot(dm.T)
>>> dout
author A B C D E F G
author
A 2 2 1 0 0 0 0
B 2 3 1 0 0 0 1
C 1 1 2 1 1 0 0
D 0 0 1 1 1 0 0
E 0 0 1 1 1 0 0
F 0 0 0 0 0 1 1
G 0 1 0 0 0 1 2
where the diagonal counts how many papers an author has written. If you really want to obliterate the diagonal and above, we can do that too:
>>> dout.values[np.triu_indices_from(dout)] = 0
>>> dout
author A B C D E F G
author
A 0 0 0 0 0 0 0
B 2 0 0 0 0 0 0
C 1 1 0 0 0 0 0
D 0 0 1 0 0 0 0
E 0 0 1 1 0 0 0
F 0 0 0 0 0 0 0
G 0 1 0 0 0 1 0
Upvotes: 1