Adam Kritz
Adam Kritz

Reputation: 11

How to create rows in a pandas dataframe that are averages of other rows?

Take a dataframe like this one:

import pandas as pd
info = {'Year': [2010, 2010, 2010, 2010, 2015, 2015, 2015, 2015],
        'Country': ['USA', 'Mexico', 'Canada', 'China', 'USA', 'Mexico', 'Canada', 'China'],
        'AgeAvg': [40, 44, 45, 49, 45, 46, 50, 52],
        'HeightAvg': [68, 65, 67, 68, 69, 70, 64, 67]}
df = pd.DataFrame(data=info)
df

   Year Country  AgeAvg  HeightAvg
0  2010     USA      40         68
1  2010  Mexico      44         65
2  2010  Canada      45         67
3  2010   China      49         68
4  2015     USA      45         69
5  2015  Mexico      46         70
6  2015  Canada      50         64
7  2015   China      52         67

I want to add rows for 2011, 2012, 2013, and 2014. These rows will follow the same Countries, and have a smoothed average of the variables. For example, 2011 USA Age will be 41, 2012 USA age 42, 2013 USA age 43, 2014 USA age 44. This way the age will span from 2010 to 2015. I would also like to do this for all variables (like height in this case), not just age. Is there a way to do this in Python with Pandas?

Upvotes: 1

Views: 125

Answers (2)

Rob Raymond
Rob Raymond

Reputation: 31226

  • generate all the combinations for all years
  • merge() to have all rows
  • interpolate() for each country (groupby())
pd.DataFrame(
    {
        "Year": range(df["Year"].min(), df["Year"].max()+1),
        "Country": [df["Country"].unique() for y in range(df["Year"].min(), df["Year"].max()+1)],
    }
).explode("Country").merge(df, on=["Year", "Country"], how="outer").groupby(
    "Country"
).apply(
    lambda d: d.interpolate()
)

Year Country AgeAvg HeightAvg
0 2010 USA 40 68
1 2010 Mexico 44 65
2 2010 Canada 45 67
3 2010 China 49 68
4 2011 USA 41 68.2
5 2011 Mexico 44.4 66
6 2011 Canada 46 66.4
7 2011 China 49.6 67.8
8 2012 USA 42 68.4
9 2012 Mexico 44.8 67
10 2012 Canada 47 65.8
11 2012 China 50.2 67.6
12 2013 USA 43 68.6
13 2013 Mexico 45.2 68
14 2013 Canada 48 65.2
15 2013 China 50.8 67.4
16 2014 USA 44 68.8
17 2014 Mexico 45.6 69
18 2014 Canada 49 64.6
19 2014 China 51.4 67.2
20 2015 USA 45 69
21 2015 Mexico 46 70
22 2015 Canada 50 64
23 2015 China 52 67

Upvotes: 0

Corralien
Corralien

Reputation: 120509

Use pd.MultiIndex.from_product to reindex your dataframe and interpolate values:

mi = pd.MultiIndex.from_product([df['Country'].unique(),
                                 range(df.Year.min(), df.Year.max()+1)])

out = df.set_index(['Country', 'Year']).reindex(mi)
out = out.groupby(level=0).apply(lambda x: x.interpolate())
>>> out
             AgeAvg  HeightAvg
USA    2010    40.0       68.0
       2011    41.0       68.2
       2012    42.0       68.4
       2013    43.0       68.6
       2014    44.0       68.8
       2015    45.0       69.0
Mexico 2010    44.0       65.0
       2011    44.4       66.0
       2012    44.8       67.0
       2013    45.2       68.0
       2014    45.6       69.0
       2015    46.0       70.0
Canada 2010    45.0       67.0
       2011    46.0       66.4
       2012    47.0       65.8
       2013    48.0       65.2
       2014    49.0       64.6
       2015    50.0       64.0
China  2010    49.0       68.0
       2011    49.6       67.8
       2012    50.2       67.6
       2013    50.8       67.4
       2014    51.4       67.2
       2015    52.0       67.0

You can swap levels if you prefer Year first.

out = out.swaplevel().sort_index()

Upvotes: 1

Related Questions