Lynn
Lynn

Reputation: 4398

Multiple groupings and groupby aggregations using Python Pandas

I have a dataset that looks very similar to the data below.

  1. I would like to create two groups using values in the sku column.

    group1 - new &

    group2 - old

  2. Within the group1 - new, I would like to then groupby rack and take the average of each grouped rack and sum everything in the group1 - old

data:

     rack   sku    used    free     total   date
     a      old     1       4       5       11/1/2020
     b      old     1       4       5       11/1/2020
     c      old     1       4       5       11/1/2020
     d      new     2       1       3       11/1/2020
     e      new     2       1       3       11/1/2020
     f      old     1       1       2       11/1/2020
     g      old     1       1       2       11/1/2020
     e      new     2       2       4       11/1/2020
     d      new     2       2       4       11/1/2020

Desired Result

        used    free    total   date

        2       1.5     3.5     11/1/2020
        2       1.5     3.5     11/1/2020
        5       14      19      11/1/2020

We get this result above because :

group1- new has two groups: d and e The average for d used is 2, the average for d free is 1.5 and the average for d total is 3.5

The average for e used is 2, the average for e free is 1.5 and the average for e total is 3.5

group2- old is simply taking the sum which the used is 5, free is 14 and total is 19.

This is what the data looks like after the grouping:

id  group   used    free    total   date
d   new     2       1       3       11/1/2020
d   new     2       2       4       11/1/2020
e   new     2       1       3       11/1/2020
e   new     2       2       4       11/1/2020
a   old     1       4       5       11/1/2020
b   old     1       4       5       11/1/2020
c   old     1       4       5       11/1/2020
f   old     1       1       2       11/1/2020
g   old     1       1       2       11/1/2020

This is what I am doing:

    new = df[df.sku.str.contains('|'.df(['new']), na = False)]

    old = df[df.sku.str.contains('|'.df(['old']), na = False)] 

set = jdf.groupby('rack').agg({'used': 'mean', 'free': 'mean', 
                                           'total': 'mean'}).sum().to_frame().T

I am just not sure how to put this together to create the new dataframe. Any suggestion is appreciated.

Upvotes: 1

Views: 88

Answers (3)

An approach to take the mean by rack to all groups except the last one (which will be the sum):

# Define last group for suming
last_grp = 'old'
# Calculate the mean by rack to all groups but last one
out = df.query('sku!=@last_grp').groupby(['sku','rack']).mean()
# Add sum from last group to df
out.loc[(last_grp, ''),:] = df.query('sku==@last_grp').select_dtypes(np.number).sum()
# Add date back to df (if more than one date, it'll be concatenated with comma)
out['date'] = ','.join(df.date.unique())

Output:

          used  free  total       date
sku rack                              
new d      2.0   1.5    3.5  11/1/2020
    e      2.0   1.5    3.5  11/1/2020
old        5.0  14.0   19.0  11/1/2020

Note: If you don't want to summarize all groups, you gonna need to replace df.query('sku!=@last_grp') by df[df.sku.isin(target_grps)] where target_grps stands by your target groups (e.g. 'new', 'foo', 'foo2').

Upvotes: 1

jezrael
jezrael

Reputation: 862771

You can filter by conditions with Series.isin for match multiple sku and for old all not match values by ~ for invert mask:

mask = df.sku.isin(['new','foo','foo1','foo3'])
new = df[mask]

old = df[~mask] 

Then aggregate mean and for all another values use sum only for numeric columns:

df = (new.groupby('rack').mean()
         .append(old.select_dtypes(np.number).sum().to_frame('old').T)
         .rename_axis('col')
         .reset_index())

print (df)
   col  backup  free  total
0    d     2.0   1.5    3.5
1    e     2.0   1.5    3.5
2  old     5.0  14.0   19.0

If possible grouping by rack and date solution is a bit changed - for old values is used first date value:

mask = df.sku.isin(['new','foo','foo1','foo3'])
new = df[mask]

old = df[~mask] 

df = (new.groupby(['rack', 'date']).mean()
         .append(old.select_dtypes(np.number).sum().to_frame(('old', old['date'].iat[0])).T)
         .reset_index())

print (df)
  rack       date  backup  free  total
0    d  11/1/2020     2.0   1.5    3.5
1    e  11/1/2020     2.0   1.5    3.5
2  old  11/1/2020     5.0  14.0   19.0

Upvotes: 2

Mykola Zotko
Mykola Zotko

Reputation: 17834

You can first groupby by 'sku'. Then you can get a target group with get_group, groupby again and calculate the summary. For example, mean value for 'new':

g = df.groupby('sku')
g.get_group('new').groupby('rack').mean()

Output:

      used  free  total
rack                   
d      2.0   1.5    3.5
e      2.0   1.5    3.5

Upvotes: 1

Related Questions