Reputation: 4398
I have a dataset that looks very similar to the data below.
I would like to create two groups using values in the sku column.
group1 - new &
group2 - old
Within the group1 - new, I would like to then groupby rack and take the average of each grouped rack and sum everything in the group1 - old
data:
rack sku used free total date
a old 1 4 5 11/1/2020
b old 1 4 5 11/1/2020
c old 1 4 5 11/1/2020
d new 2 1 3 11/1/2020
e new 2 1 3 11/1/2020
f old 1 1 2 11/1/2020
g old 1 1 2 11/1/2020
e new 2 2 4 11/1/2020
d new 2 2 4 11/1/2020
Desired Result
used free total date
2 1.5 3.5 11/1/2020
2 1.5 3.5 11/1/2020
5 14 19 11/1/2020
We get this result above because :
group1- new has two groups: d and e The average for d used is 2, the average for d free is 1.5 and the average for d total is 3.5
The average for e used is 2, the average for e free is 1.5 and the average for e total is 3.5
group2- old is simply taking the sum which the used is 5, free is 14 and total is 19.
This is what the data looks like after the grouping:
id group used free total date
d new 2 1 3 11/1/2020
d new 2 2 4 11/1/2020
e new 2 1 3 11/1/2020
e new 2 2 4 11/1/2020
a old 1 4 5 11/1/2020
b old 1 4 5 11/1/2020
c old 1 4 5 11/1/2020
f old 1 1 2 11/1/2020
g old 1 1 2 11/1/2020
This is what I am doing:
new = df[df.sku.str.contains('|'.df(['new']), na = False)]
old = df[df.sku.str.contains('|'.df(['old']), na = False)]
set = jdf.groupby('rack').agg({'used': 'mean', 'free': 'mean',
'total': 'mean'}).sum().to_frame().T
I am just not sure how to put this together to create the new dataframe. Any suggestion is appreciated.
Upvotes: 1
Views: 88
Reputation: 4929
An approach to take the mean by rack to all groups except the last one (which will be the sum):
# Define last group for suming
last_grp = 'old'
# Calculate the mean by rack to all groups but last one
out = df.query('sku!=@last_grp').groupby(['sku','rack']).mean()
# Add sum from last group to df
out.loc[(last_grp, ''),:] = df.query('sku==@last_grp').select_dtypes(np.number).sum()
# Add date back to df (if more than one date, it'll be concatenated with comma)
out['date'] = ','.join(df.date.unique())
Output:
used free total date
sku rack
new d 2.0 1.5 3.5 11/1/2020
e 2.0 1.5 3.5 11/1/2020
old 5.0 14.0 19.0 11/1/2020
Note: If you don't want to summarize all groups, you gonna need to replace df.query('sku!=@last_grp')
by df[df.sku.isin(target_grps)]
where target_grps
stands by your target groups (e.g. 'new', 'foo', 'foo2').
Upvotes: 1
Reputation: 862771
You can filter by conditions with Series.isin
for match multiple sku
and for old all not match values by ~
for invert mask:
mask = df.sku.isin(['new','foo','foo1','foo3'])
new = df[mask]
old = df[~mask]
Then aggregate mean
and for all another values use sum
only for numeric columns:
df = (new.groupby('rack').mean()
.append(old.select_dtypes(np.number).sum().to_frame('old').T)
.rename_axis('col')
.reset_index())
print (df)
col backup free total
0 d 2.0 1.5 3.5
1 e 2.0 1.5 3.5
2 old 5.0 14.0 19.0
If possible grouping by rack
and date
solution is a bit changed - for old values is used first date
value:
mask = df.sku.isin(['new','foo','foo1','foo3'])
new = df[mask]
old = df[~mask]
df = (new.groupby(['rack', 'date']).mean()
.append(old.select_dtypes(np.number).sum().to_frame(('old', old['date'].iat[0])).T)
.reset_index())
print (df)
rack date backup free total
0 d 11/1/2020 2.0 1.5 3.5
1 e 11/1/2020 2.0 1.5 3.5
2 old 11/1/2020 5.0 14.0 19.0
Upvotes: 2
Reputation: 17834
You can first groupby
by 'sku'
. Then you can get a target group with get_group
, groupby
again and calculate the summary. For example, mean value for 'new'
:
g = df.groupby('sku')
g.get_group('new').groupby('rack').mean()
Output:
used free total
rack
d 2.0 1.5 3.5
e 2.0 1.5 3.5
Upvotes: 1