Reputation: 61

In Python Pandas, how to use like R dplyr mutate_each

In Python Pandas, I want to add columns by executing multiple aggregate functions on multiple columns like R dplyr mutate_each. For example, Can Python Pandas realize the same processing as the following R script?

R dplyr :

 iris %>%
   group_by(Species) %>%
   mutate_each(funs(min, max, mean), starts_with("Sepal"))

However, I was able to achieve the same processing as mutate with Pandas. As shown in the code below, I could execute one aggregate function and add one column.

R dplyr :

 iris %>% group_by(Species) %>% mutate(MaxSepalLen = max(Sepal.Length))

Python Pandas :

 iris.assign(MaxSepalLen = iris.groupby("Species")["Sepal.Length"].transform('max'))

Upvotes: 6

Answers (2)

Panwen Wang

Reputation: 3835

mutate_each is superseded by mutate and across.

You can try this in python:

>>> from datar.all import f, group_by, starts_with, mutate, across, max, min, mean
>>> from datar.datasets import iris
>>> 
>>> iris >> \
...    group_by(f.Species) >> \
...    mutate(across(starts_with("Sepal"), [min, max, mean]))
     Sepal_Length  Sepal_Width  Petal_Length  Petal_Width    Species  Sepal_Length_1  Sepal_Length_2  Sepal_Length_3  Sepal_Width_1  Sepal_Width_2  Sepal_Width_3
        <float64>    <float64>     <float64>    <float64>   <object>       <float64>       <float64>       <float64>      <float64>      <float64>      <float64>
0             5.1          3.5           1.4          0.2     setosa             4.3             5.8           5.006            2.3            4.4          3.428
1             4.9          3.0           1.4          0.2     setosa             4.3             5.8           5.006            2.3            4.4          3.428
2             4.7          3.2           1.3          0.2     setosa             4.3             5.8           5.006            2.3            4.4          3.428
3             4.6          3.1           1.5          0.2     setosa             4.3             5.8           5.006            2.3            4.4          3.428
..            ...          ...           ...          ...        ...             ...             ...             ...            ...            ...            ...
4             5.0          3.6           1.4          0.2     setosa             4.3             5.8           5.006            2.3            4.4          3.428
145           6.7          3.0           5.2          2.3  virginica             4.9             7.9           6.588            2.2            3.8          2.974
146           6.3          2.5           5.0          1.9  virginica             4.9             7.9           6.588            2.2            3.8          2.974
147           6.5          3.0           5.2          2.0  virginica             4.9             7.9           6.588            2.2            3.8          2.974
148           6.2          3.4           5.4          2.3  virginica             4.9             7.9           6.588            2.2            3.8          2.974
149           5.9          3.0           5.1          1.8  virginica             4.9             7.9           6.588            2.2            3.8          2.974

[Groups: Species (n=3)]
[150 rows x 11 columns]

I am the author of the datar package.

Upvotes: 0

David Dale

Reputation: 11434

With Pandas, this can be accomplished in a more lenghty way.

First, let's prepare the data:

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris_data = load_iris()
iris = pd.DataFrame(iris_data.data, columns = [c[0:3] + c[6] for c in iris_data.feature_names])
iris['Species'] = iris_data.target_names[iris_data.target]

Now we can imitate the mutate_each pipeline:

# calculate the aggregates
pivot = iris.groupby("Species")[iris.columns[iris.columns.str.startswith('sepal')]
                               ].aggregate(['min', 'max', np.mean])
# name the aggregates
pivot.columns = pivot.columns.get_level_values(0) + pivot.columns.get_level_values(1)
# merge aggregates with the original dataframe
new_iris = iris.merge(pivot, left_on='Species', right_index=True)

The pivot table is really a small pivot table:

            seplmin  seplmax  seplmean  sepwmin  sepwmax  sepwmean
Species                                                           
setosa          4.3      5.8     5.006      2.3      4.4     3.418
versicolor      4.9      7.0     5.936      2.0      3.4     2.770
virginica       4.9      7.9     6.588      2.2      3.8     2.974

And the new_iris is a 150x11 table with all columns from iris and pivot combined, identical to what dplyr outputs.

Upvotes: 1

In Python Pandas, how to use like R dplyr mutate_each

Answers (2)

Related Questions