Reputation: 11
I'm practicing stats and pandas and I want to create a third column called 'Mean' which works out the means of each grouped location so that each separate row can then perform calculations with it. Here's what I'm starting with:
import pandas as pd
df = pd.DataFrame({'Location': ['Alaska', 'Alaska', 'Amsterdam', 'Amsterdam',
'Arkansas', 'Arkansas'],
'Number': ['300', '500', '250', '600', '400', '150']}
Here's what I want it to look like:
The real_deviations column is to give an idea of what I'd be using it for. I figured it out using a rather long-winded way of groupby and merges but am stuck on trying to find a more streamlined solution.
Any help would be appreciated!
Thanks!
Upvotes: 1
Views: 246
Reputation: 1387
To generate the location-specific mean, you need to do a groupby transform:
df['mean'] = df.groupby('Location')['Number'].transform('mean')
Then, you can proceed to calculate the deviations as:
df['real_deviations'] = df['Number'] - df['mean']
PS: be sure to convert your numbers in "Number" from str to int beforehand. A simple df['Number'] = df['Number'].astype(int)
will do.
Upvotes: 3
Reputation: 31166
The core capability you need is transform
df = pd.DataFrame({'Location': ['Alaska', 'Alaska', 'Amsterdam', 'Amsterdam',
'Arkansas', 'Arkansas'],
'Number': ['300', '500', '250', '600', '400', '150']})
df["Number"] = df["Number"].astype(int)
df.assign(means=df.groupby("Location", as_index=False).transform("mean"),
real_deviation=lambda d: d["Number"] - d["means"])
Location | Number | means | real_deviation | |
---|---|---|---|---|
0 | Alaska | 300 | 400 | -100 |
1 | Alaska | 500 | 400 | 100 |
2 | Amsterdam | 250 | 425 | -175 |
3 | Amsterdam | 600 | 425 | 175 |
4 | Arkansas | 400 | 275 | 125 |
5 | Arkansas | 150 | 275 | -125 |
Upvotes: 0