Reputation: 2333
I have a dataframe df
:
AID JID CID
0 1 A NaN
1 1 A NaN
2 1 B NaN
3 1 NaN X
4 3 A NaN
5 4 NaN NaN
6 4 C X
7 5 C Y
8 5 C X
9 6 A NaN
10 6 B NaN
I want to calculate how many times has AID used either JID or CID.
Resulting dataframe should be like this, where the index is the AID values and the columns are the CID and JID values:
A B C X Y
1 2 1 0 1 0
3 1 0 0 0 0
4 0 0 1 1 0
5 0 0 2 1 1
6 1 1 0 0 0
I know how to do it by looping and counting manually. But I was wondering what the more efficient way is?
Upvotes: 0
Views: 37
Reputation: 353149
I'd melt
and then use pivot_table
:
In [80]: d2 = pd.melt(df, id_vars="AID")
In [81]: d2.pivot_table(index="AID", columns="value", values="variable",
aggfunc="count", fill_value=0)
Out[81]:
value A B C X Y
AID
1 2 1 0 1 0
3 1 0 0 0 0
4 0 0 1 1 0
5 0 0 2 1 1
6 1 1 0 0 0
This works because melt
"flattens" the dataframe into something where we can more easily access the values together, and pivot_table
is for exactly the type of aggregation you have in mind:
In [90]: pd.melt(df, "AID")
Out[90]:
AID variable value
0 1 JID A
1 1 JID A
2 1 JID B
3 1 JID NaN
4 3 JID A
[... skipped]
17 4 CID X
18 5 CID Y
19 5 CID X
20 6 CID NaN
21 6 CID NaN
Upvotes: 3
Reputation: 862751
You can create first Series
by stack
and then groupby
with value_counts
. Last reshape by unstack
:
df = df.set_index('AID').stack().groupby(level=0).value_counts().unstack(1, fill_value=0)
print (df)
A B C X Y
AID
1 2 1 0 1 0
3 1 0 0 0 0
4 0 0 1 1 0
5 0 0 2 1 1
6 1 1 0 0 0
Upvotes: 1