Reputation: 21562
I have this kind of dataframe df:
User,C,G
111,ar,1
112,es,1
112,es,1
112,es,2
113,es,2
113,es,3
113,es,3
114,es,4
What I would like to return as output is:
G,nU,ar,es
1,2,1,1
2,2,0,2
3,1,0,1
4,1,0,1
Basically, for each G
, I'm counting the number of different User
inside it on the nU
column and the occurrences of the strings in C
. Each User has a unique C
value.
For instance, in the G
number 1 I have two Users (111 and 112), with one occurrence in 'ar' and one in 'es' (no matter if there are two 112 occurrences, I just need the (112,'es') single couple). Summing up the 'ar' and 'es' columns should return the nU
column. So far I tried this:
d = df.reset_index().groupby('G')['User'].nunique()
which correctly returns the count of Users but no information about the C
column.
Sorry for the confusion this might cause.
Upvotes: 2
Views: 9535
Reputation: 879919
Given df
,
result = df.groupby(['G', 'User'])['C'].value_counts()
yields
G User
1 111 ar 1
112 es 2
2 112 es 1
113 es 1
3 113 es 2
4 114 es 1
dtype: int64
This counts each occurrence of ar
and es
. We really only want to count unique occurrences, so let's set each value in the Series to 1:
result[:] = 1
so that result
looks like
G User
1 111 ar 1
112 es 1
2 112 es 1
113 es 1
3 113 es 1
4 114 es 1
dtype: int64
Now if we group by the first and last index levels (the G
values and the C
values), and sum each group,
result = result.groupby(level=['G',-1]).sum()
we get
G
1 ar 1
es 1
2 es 2
3 es 1
4 es 1
dtype: int64
Now we can unstack the last index level:
result = result.unstack()
to obtain
ar es
G
1 1 1
2 NaN 2
3 NaN 1
4 NaN 1
Fill the NaNs with zeros:
result = result.fillna(0)
Define the nU
column and the sum of the rows:
result['nU'] = result.sum(axis=1)
and reorder the columns:
result = result[['nU', 'ar', 'es']]
Putting it all together:
import pandas as pd
df = pd.read_csv('data')
result = df.groupby(['G', 'User'])['C'].value_counts()
result[:] = 1
result = result.groupby(level=['G',-1]).sum()
result = result.unstack()
result = result.fillna(0)
result['nU'] = result.sum(axis=1)
result = result[['nU', 'ar', 'es']]
yields
nU ar es
G
1 2 1 1
2 2 0 2
3 1 0 1
4 1 0 1
Upvotes: 4