Reputation: 137
I have a dataframe with two columns ID and labels. Labels can only be 0 or 1.
The code below generates such a dataframe
data = [[10105, 1], [10105, 1], [10105, 0], [20205, 0], [20205, 0], [20205, 1], [20205, 1]]
test=pd.DataFrame(data,columns=["ID","label"])
test
ID label
0 10105 1
1 10105 1
2 10105 0
3 20205 0
4 20205 0
5 20205 1
6 20205 1
I would like to get some statistics about the labels once data is grouped by ID.
The
test.groupby('ID')
will group the entries by ID but then I want to see how many entries with the ID 10105 have a label of 1 and how many have a label of a 0. Also I would like to calculate the percentage of 0s. That would be then the ideal output
ID 10105, label1: 2, label0: 1, Percantage (label0/(label1+label0)): 1/3
ID 20205, label1: 2, label0: 2, Percantage (label0/(label1+label0)): 2/4
I think python has a way to aggregate results but at the same time I need a way to make calculations between the labels of a specific ID.
Can you please help me?
I would like to thank you in advance for your reply.
Regards Alex
Upvotes: 3
Views: 5283
Reputation: 862761
You can use GroupBy.agg
with tuples for new column name with aggregate function - for label0
and Percantage
compare values by 0
and count by sum
, mean
for percentage. It working, because True
s are processing like 1
.
df = test.groupby('ID')['label'].agg([('label1','sum'),
('label0',lambda x: x.eq(0).sum()),
('Percantage',lambda x: x.eq(0).mean())])
print (df)
label1 label0 Percantage
ID
10105 2 1 0.333333
20205 2 2 0.500000
If use pandas 0.25+:
def label0(x):
return x.eq(0).sum()
def Percantage(x):
return x.eq(0).mean()
df = test.groupby('ID').agg(label1=pd.NamedAgg(column='label', aggfunc='sum'),
label0=pd.NamedAgg(column='label', aggfunc=label0),
Percantage=pd.NamedAgg(column='label', aggfunc=Percantage))
Last if need column from index:
df = df.reset_index()
print (df)
ID label1 label0 Percantage
0 10105 2 1 0.333333
1 20205 2 2 0.500000
Upvotes: 4
Reputation: 2887
Use:
>>> test['label'].groupby(test['ID']).value_counts(normalize=True)
ID label
10105 1 0.666667
0 0.333333
20205 0 0.500000
1 0.500000
Name: label, dtype: float64
@jezrael I did not know you can group a Series by another column. Neat
Upvotes: 1