Reputation: 11
Take a dataframe like this one:
import pandas as pd
info = {'Year': [2010, 2010, 2010, 2010, 2015, 2015, 2015, 2015],
'Country': ['USA', 'Mexico', 'Canada', 'China', 'USA', 'Mexico', 'Canada', 'China'],
'AgeAvg': [40, 44, 45, 49, 45, 46, 50, 52],
'HeightAvg': [68, 65, 67, 68, 69, 70, 64, 67]}
df = pd.DataFrame(data=info)
df
Year Country AgeAvg HeightAvg
0 2010 USA 40 68
1 2010 Mexico 44 65
2 2010 Canada 45 67
3 2010 China 49 68
4 2015 USA 45 69
5 2015 Mexico 46 70
6 2015 Canada 50 64
7 2015 China 52 67
I want to add rows for 2011, 2012, 2013, and 2014. These rows will follow the same Countries, and have a smoothed average of the variables. For example, 2011 USA Age will be 41, 2012 USA age 42, 2013 USA age 43, 2014 USA age 44. This way the age will span from 2010 to 2015. I would also like to do this for all variables (like height in this case), not just age. Is there a way to do this in Python with Pandas?
Upvotes: 1
Views: 125
Reputation: 31226
merge()
to have all rowsinterpolate()
for each country (groupby()
)pd.DataFrame(
{
"Year": range(df["Year"].min(), df["Year"].max()+1),
"Country": [df["Country"].unique() for y in range(df["Year"].min(), df["Year"].max()+1)],
}
).explode("Country").merge(df, on=["Year", "Country"], how="outer").groupby(
"Country"
).apply(
lambda d: d.interpolate()
)
Year | Country | AgeAvg | HeightAvg | |
---|---|---|---|---|
0 | 2010 | USA | 40 | 68 |
1 | 2010 | Mexico | 44 | 65 |
2 | 2010 | Canada | 45 | 67 |
3 | 2010 | China | 49 | 68 |
4 | 2011 | USA | 41 | 68.2 |
5 | 2011 | Mexico | 44.4 | 66 |
6 | 2011 | Canada | 46 | 66.4 |
7 | 2011 | China | 49.6 | 67.8 |
8 | 2012 | USA | 42 | 68.4 |
9 | 2012 | Mexico | 44.8 | 67 |
10 | 2012 | Canada | 47 | 65.8 |
11 | 2012 | China | 50.2 | 67.6 |
12 | 2013 | USA | 43 | 68.6 |
13 | 2013 | Mexico | 45.2 | 68 |
14 | 2013 | Canada | 48 | 65.2 |
15 | 2013 | China | 50.8 | 67.4 |
16 | 2014 | USA | 44 | 68.8 |
17 | 2014 | Mexico | 45.6 | 69 |
18 | 2014 | Canada | 49 | 64.6 |
19 | 2014 | China | 51.4 | 67.2 |
20 | 2015 | USA | 45 | 69 |
21 | 2015 | Mexico | 46 | 70 |
22 | 2015 | Canada | 50 | 64 |
23 | 2015 | China | 52 | 67 |
Upvotes: 0
Reputation: 120509
Use pd.MultiIndex.from_product
to reindex your dataframe and interpolate values:
mi = pd.MultiIndex.from_product([df['Country'].unique(),
range(df.Year.min(), df.Year.max()+1)])
out = df.set_index(['Country', 'Year']).reindex(mi)
out = out.groupby(level=0).apply(lambda x: x.interpolate())
>>> out
AgeAvg HeightAvg
USA 2010 40.0 68.0
2011 41.0 68.2
2012 42.0 68.4
2013 43.0 68.6
2014 44.0 68.8
2015 45.0 69.0
Mexico 2010 44.0 65.0
2011 44.4 66.0
2012 44.8 67.0
2013 45.2 68.0
2014 45.6 69.0
2015 46.0 70.0
Canada 2010 45.0 67.0
2011 46.0 66.4
2012 47.0 65.8
2013 48.0 65.2
2014 49.0 64.6
2015 50.0 64.0
China 2010 49.0 68.0
2011 49.6 67.8
2012 50.2 67.6
2013 50.8 67.4
2014 51.4 67.2
2015 52.0 67.0
You can swap levels if you prefer Year
first.
out = out.swaplevel().sort_index()
Upvotes: 1