Reputation: 26069
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'stuff_1_var_1': range(5),
'stuff_1_var_2': range(2, 7),
'stuff_2_var_1': range(3, 8),
'stuff_2_var_2': range(5, 10)
})
stuff_1_var_1 stuff_1_var_2 stuff_2_var_1 stuff_2_var_2
0 0 2 3 5
1 1 3 4 6
I would like to groupby
based on the column headers and then add the mean
and median
of each group as new columns. So my expected output looks like this:
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
Brief explanation:
we have two groups stuff_1_var_
and stuff_2_var_
for which would calculate the mean and median per row. So, e.g. for stuff_1_var_
it would be:
# values from stuff_1_var_1 and stuff_1_var_2
(0 + 2) / 2 = 1 and
( 1 + 3) / 2 = 2
The values are then added as a new column stuff_1_var_mean
; analogue for meadian
and stuff_2_var_
.
I got until:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T
stuff_1_var_ stuff_2_var_
0 mean 1 4
median 1 4
1 mean 2 5
median 2 5
How can I do the final step(s)?
Upvotes: 0
Views: 266
Reputation: 863791
Your solution should be changed:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T.unstack()
dfgb.columns = dfgb.columns.map(lambda x: f'{x[0]}{x[1]}')
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Unfortunately for axis=1
is not implemented agg
, so possible solution is create mean
and median
separately and then concat
:
dfgb = df.groupby(pattern, axis=1).agg(['mean','median'])
NotImplementedError: axis other than 0 is not supported
pattern = df.columns.str.extract('(^stuff_\d_var_)', expand=False)
g = df.groupby(pattern, axis=1)
dfgb = pd.concat([g.mean().add_suffix('mean'),
g.median().add_suffix('median')], axis=1)
dfgb = dfgb.iloc[:, [0,2,1,3]]
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Upvotes: 1
Reputation: 21759
Here's a way you can do:
col = 'stuff_1_var_'
use_col = [x for x in df.columns if 'stuff_1' in x]
df[f'{col}mean'] = df[use_col].mean(1)
df[f'{col}median'] = df[use_col].median(1)
col2 = 'stuff_2_var_'
use_col = [x for x in df.columns if 'stuff_2' in x]
df[f'{col2}mean'] = df[use_col].mean(1)
df[f'{col2}median'] = df[use_col].median(1)
print(df.iloc[:,-4:]) # showing last four new columns
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1.0 1.0 4.0 4.0
1 2.0 2.0 5.0 5.0
2 3.0 3.0 6.0 6.0
3 4.0 4.0 7.0 7.0
4 5.0 5.0 8.0 8.0
Ofcourse, you can put it in a function to avoid repeating the same code.
Upvotes: 1