Reputation: 1151
I'm trying to do some aggregations on a pandas data frame. Here is a sample code:
import pandas as pd
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0]})
df.groupby(["User"]).agg({"Amount": {"Sum": "sum", "Count": "count"}})
Out[1]:
Amount
Sum Count
User
user1 18.0 2
user2 20.5 3
user3 10.5 1
Which generates the following warning:
FutureWarning: using a dict with renaming is deprecated and will be removed in a future version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
How can I avoid this?
Upvotes: 65
Views: 68159
Reputation: 7524
The best solution for me would be from Jacob Stevenson. However there is still something to improve:
Amount
The sum The count
User
user1 18.0 2
user2 20.5 3
user3 10.5 1
The headers could be one-liner if Amount
is removed.
This variant does it using pd.NamedAgg:
# Using Titanic set
filepath = r'data\titanic.csv'
df = pd.read_csv(filepath, index_col='PassengerId')
# Using pd.NamedAgg
g = (df[['Sex', 'Pclass', 'Fare', 'Age']]
.groupby(['Sex', 'Pclass'])
.agg(**{'#': pd.NamedAgg('Fare', 'size'),
'Fare mean': pd.NamedAgg('Fare', 'mean'),
'Age mean': pd.NamedAgg('Age', 'mean')})
.astype(int)
)
# Fare mean Age mean
Sex Pclass
female 1 94 106 34
2 76 21 28
3 144 16 21
male 1 122 67 41
2 108 19 30
3 347 12 26
Upvotes: 0
Reputation: 121
This worked for me, Pandas version 1.2.4
For each column we add a list which consists of tuples:
df.groupby('column to group by').agg(
{'column name': [('new column name', 'function to apply')]})
Example
# Create DataFrame
df=pd.DataFrame(data={'id':[1,1,2,3],'col1': [1,2,1,5], 'col2':[5,8,6,4]})
# Apply grouping
grouped = df.groupby('id').agg({
'col1': [('name1', 'sum')],
'col2': [('name2_mean', 'sum'), ('name2_custom_std', lambda x: np.std(x))]})
# Drop multi-index for columns and reset index
grouped.columns = grouped.columns.droplevel()
grouped.reset_index(inplace=True)
Result:
id | name1 | name2_mean | name2_custom_std | |
---|---|---|---|---|
0 | 1 | 3 | 13 | 1.5 |
1 | 2 | 1 | 6 | 0.0 |
2 | 3 | 5 | 4 | 0.0 |
Upvotes: 12
Reputation: 153510
import pandas as pd
print(pd.__version__)
#0.25.0
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0]})
df.groupby("User")['Amount'].agg(Sum='sum', Count='count')
Output:
Sum Count
User
user1 18.0 2
user2 20.5 3
user3 10.5 1
Upvotes: 21
Reputation: 151
Replace the inner dictionaries with a list of correctly named functions.
To rename the function I'm using this utility function:
def aliased_aggr(aggr, name):
if isinstance(aggr,str):
def f(data):
return data.agg(aggr)
else:
def f(data):
return aggr(data)
f.__name__ = name
return f
The group-by statement then becomes:
df.groupby(["User"]).agg({"Amount": [
aliased_aggr("sum","Sum"),
aliased_aggr("count","Count")
]
If you have bigger, reusable aggregation specs, you can convert them with
def convert_aggr_spec(aggr_spec):
return {
col : [
aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
}
So you can say
df.groupby(["User"]).agg(convert_aggr_spec({"Amount": {"Sum": "sum", "Count": "count"}}))
See also https://github.com/pandas-dev/pandas/issues/18366#issuecomment-476597674
Upvotes: 0
Reputation: 1059
This is what I did:
Create a fake dataset:
import pandas as pd
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1", "user3"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0, 9],
'Score': [9, 1, 8, 7, 7, 6, 9]})
df
O/P:
Amount Score User
0 10.0 9 user1
1 5.0 1 user2
2 8.0 8 user2
3 10.5 7 user3
4 7.5 7 user2
5 8.0 6 user1
6 9.0 9 user3
I first made the User the index, and then a groupby:
ans = df.set_index('User').groupby(level=0)['Amount'].agg([('Sum','sum'),('Count','count')])
ans
Solution:
Sum Count
User
user1 18.0 2
user2 20.5 3
user3 19.5 2
Upvotes: 3
Reputation: 3756
If you replace the internal dictionary with a list of tuples it gets rid of the warning message
import pandas as pd
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0]})
df.groupby(["User"]).agg({"Amount": [("Sum", "sum"), ("Count", "count")]})
Upvotes: 18
Reputation: 62017
apply
and return a Series to rename columnsUse the groupby apply
method to perform an aggregation that
To do this:
apply
Create fake data
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1", "user3"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0, 9],
'Score': [9, 1, 8, 7, 7, 6, 9]})
create custom function that returns a Series
The variable x
inside of my_agg
is a DataFrame
def my_agg(x):
names = {
'Amount mean': x['Amount'].mean(),
'Amount std': x['Amount'].std(),
'Amount range': x['Amount'].max() - x['Amount'].min(),
'Score Max': x['Score'].max(),
'Score Sum': x['Score'].sum(),
'Amount Score Sum': (x['Amount'] * x['Score']).sum()}
return pd.Series(names, index=['Amount range', 'Amount std', 'Amount mean',
'Score Sum', 'Score Max', 'Amount Score Sum'])
Pass this custom function to the groupby apply
method
df.groupby('User').apply(my_agg)
The big downside is that this function will be much slower than agg
for the cythonized aggregations
agg
methodUsing a dictionary of dictionaries was removed because of its complexity and somewhat ambiguous nature. There is an ongoing discussion on how to improve this functionality in the future on github Here, you can directly access the aggregating column after the groupby call. Simply pass a list of all the aggregating functions you wish to apply.
df.groupby('User')['Amount'].agg(['sum', 'count'])
Output
sum count
User
user1 18.0 2
user2 20.5 3
user3 10.5 1
It is still possible to use a dictionary to explicitly denote different aggregations for different columns, like here if there was another numeric column named Other
.
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
"Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0],
'Other': [1,2,3,4,5,6]})
df.groupby('User').agg({'Amount' : ['sum', 'count'], 'Other':['max', 'std']})
Output
Amount Other
sum count max std
User
user1 18.0 2 6 3.535534
user2 20.5 3 5 1.527525
user3 10.5 1 4 NaN
Upvotes: 94