Reputation: 75
My dataframe resembles the following:
group_var1 = ['A1','A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A2', 'A2', 'A2', 'A2', 'A2']
group_var2 = ['B1', 'B1', 'B1', 'B1', 'B2', 'B2', 'B2', 'B2', 'B1', 'B1', 'B1', 'B1', 'B2', 'B2', 'B2', 'B2']
group_var3 = ['C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2']
value = np.arange(len(group_var1))
ex_df = pd.DataFrame({
'group_var1' : group_var1,
'group_var2' : group_var2,
'group_var3' : group_var3,
'value' : value
})
which should produce the following when called:
group_va1 group_va2 group_va3 value
0 A1 B1 C1 0
1 A1 B1 C2 1
2 A1 B1 C1 2
3 A1 B1 C2 3
4 A1 B2 C1 4
5 A1 B2 C2 5
6 A1 B2 C1 6
7 A1 B2 C2 7
8 A2 B1 C1 8
9 A2 B1 C2 9
10 A2 B1 C1 10
11 A2 B1 C2 11
12 A2 B2 C1 12
13 A2 B2 C2 13
14 A2 B2 C1 14
15 A2 B2 C2 15
My goal is to group the DataFrame by columns group_var1
, group_var2
, and group_var3
, and then calculate the mean value
within each group, and add a new row mean_ex
to this existing DataFrame with these results. In R this could be accomplished by ex_df %>% group_by(c(group_var1, group_var2, group_var3)) %>% mutate(mean_ex = mean(value))
which automatically handles assigning the new value to the respective rows.
I have found ways to summarize the DataFrame by aggregating all of the values within each group into only the mean value (ex_df.groupby(['group_var1', 'group_var2', 'group_var3']).mean()
), but I want the mean value to be a new column in my existing DataFrame, not a df of smaller dimensions. My desired output can be found below, and was achieved by using pd.merge() on my original df and my previously described undesired aggregated table:
group_var1 group_var2 group_var3 value mean_ex
0 A1 B1 C1 0 1
1 A1 B1 C2 1 2
2 A1 B1 C1 2 1
3 A1 B1 C2 3 2
4 A1 B2 C1 4 5
5 A1 B2 C2 5 6
6 A1 B2 C1 6 5
7 A1 B2 C2 7 6
8 A2 B1 C1 8 9
9 A2 B1 C2 9 10
10 A2 B1 C1 10 9
11 A2 B1 C2 11 10
12 A2 B2 C1 12 13
13 A2 B2 C2 13 14
14 A2 B2 C1 14 13
15 A2 B2 C2 15 14
My question is, is there any way to achieve this desired output without having to create a separate dataframe that then gets merged back onto the original? Thanks.
Upvotes: 3
Views: 1701
Reputation: 3835
You can implement it in the same way in python as you did in R, using datar
:
>>> from datar.all import c, f, tibble, mutate, mean, group_by
>>>
>>> group_var1 = ['A1','A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A2', 'A2', 'A2', 'A2'
, 'A2']
>>> group_var2 = ['B1', 'B1', 'B1', 'B1', 'B2', 'B2', 'B2', 'B2', 'B1', 'B1', 'B1', 'B1', 'B2', 'B2', 'B2
', 'B2']
>>> group_var3 = ['C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1', 'C2', 'C1
', 'C2']
>>>
>>> df = tibble(
... group_var1,
... group_var2,
... group_var3,
... value=range(len(group_var1))
... )
>>>
>>> df >> group_by(
... f.group_var1, f.group_var2, f.group_var3
... ) >> mutate(
... mean_ex = mean(f.value)
... )
group_var1 group_var2 group_var3 value mean_ex
<object> <object> <object> <int64> <float64>
0 A1 B1 C1 0 1.0
1 A1 B1 C2 1 2.0
2 A1 B1 C1 2 1.0
3 A1 B1 C2 3 2.0
4 A1 B2 C1 4 5.0
5 A1 B2 C2 5 6.0
6 A1 B2 C1 6 5.0
7 A1 B2 C2 7 6.0
8 A2 B1 C1 8 9.0
9 A2 B1 C2 9 10.0
10 A2 B1 C1 10 9.0
11 A2 B1 C2 11 10.0
12 A2 B2 C1 12 13.0
13 A2 B2 C2 13 14.0
14 A2 B2 C1 14 13.0
15 A2 B2 C2 15 14.0
[Groups: group_var1, group_var2, group_var3 (n=8)]
Disclaimer: I am the author of the datar
package.
Upvotes: 1
Reputation: 21274
Use transform
and assign
:
ex_df.assign(
mean_val =
ex_df
.groupby(["group_var1", "group_var2", "group_var3"])
.value
.transform('mean')
)
group_var1 group_var2 group_var3 value mean_val
0 A1 B1 C1 0 1
1 A1 B1 C2 1 2
2 A1 B1 C1 2 1
3 A1 B1 C2 3 2
4 A1 B2 C1 4 5
5 A1 B2 C2 5 6
6 A1 B2 C1 6 5
7 A1 B2 C2 7 6
8 A2 B1 C1 8 9
9 A2 B1 C2 9 10
10 A2 B1 C1 10 9
11 A2 B1 C2 11 10
12 A2 B2 C1 12 13
13 A2 B2 C2 13 14
14 A2 B2 C1 14 13
15 A2 B2 C2 15 14
Explanation
Pandas assign
is the rough equivalent of dplyr::mutate
, and transform
broadcasts the grouping operation across all the initial rows of an input, rather than simply calling an aggregation function after groupby
.
Something like df.groupby('a').x.mean()
will result in a single value per grouped index, set, which is the analog to dplyr::summarise
.
Upvotes: 1