amit singh
amit singh

Reputation: 177

Pandas groupy strings with conditions

data = {'Item_No':['001', '001', '002','002','002','003','003'], 
        'kitting':['no','yes', 'no', 'yes', 'no','no','no']} 
df = pd.DataFrame(data)

I would like to group by 'Item_No'. Change Kitting column to 'yes' for any Item_no that has at least 1 'yes' otherwise 'no'. And a new Column to get % of total. like below...

Item_No     Kitting          %kitting 
    001      yes               50%  
    002      yes               33%
    003       no                0%

Upvotes: 1

Views: 78

Answers (2)

Ch3steR
Ch3steR

Reputation: 20659

We can use GroupBy.value_counts with normalize set to True. Then use Series.reindex to extract where level 1 has 'yes'. If prec is 0 then no yes is present.

out = df.groupby("Item_No")["kitting"].value_counts(normalize=True)
idx = pd.MultiIndex.from_product((df["Item_No"].unique(), ["yes", "no"]))
out = out.reindex(idx, fill_value=0).loc[:, "yes"].to_frame("prec")
out = (
    out.assign(kitting=np.where(out["prec"].eq(0), "no", "yes"))
    .rename_axis("Item_No")
    .reset_index()
)

print(out)

  Item_No      prec kitting
0     001  0.500000     yes
1     002  0.333333     yes
2     003  0.000000      no

Another option (thanks to jezrael)

out = (
    df.groupby("Item_No")["kitting"]
    .value_counts(normalize=True)
    .unstack(fill_value=0)["yes"]
    .to_frame("prec")
)
out.assign(kitting=np.where(out["prec"].eq(0), "no", "yes")).reset_index()
   

 Item_No      prec kitting
0     001  0.500000     yes
1     002  0.333333     yes
2     003  0.000000      no

Upvotes: 1

jezrael
jezrael

Reputation: 863176

First idea is replace non yes values to NaN in helper column in DataFrame.assign and use GroupBy.agg for first non missing value and for percentage is used mean in lambda function with multiple 100:

m = df['kitting'].eq('yes')
df = (df.assign(m = m, a = df['kitting'].where(m))
        .groupby('Item_No', as_index=False)
        .agg(Kitting = ('a','first'),
             perc = ('m', lambda x: x.mean() * 100))
        .fillna('no'))
print (df)
  Item_No Kitting       perc
0     001     yes  50.000000
1     002     yes  33.333333
2     003      no   0.000000
    

Another idea is use ordered categoricals and aggregate min first:

s = pd.Categorical(df['kitting'], ordered=True, categories=['yes','no'])
df = (df.assign(m = s == 'yes', a = s)
        .groupby('Item_No', as_index=False)
        .agg(Kitting = ('a','min'),
             perc = ('m', lambda x: x.mean() * 100)))
print (df)
  Item_No Kitting       perc
0     001     yes  50.000000
1     002     yes  33.333333
2     003      no   0.000000
    

Upvotes: 3

Related Questions