Christian Adib
Christian Adib

Reputation: 121

Obtain most recent value for based on index in a pandas dataframe

On every meeting date, new gdp growth forecasts are made 3 to 4 years into the future. If the gdp growth forecast for a forecast_year is similar to what it was in the last meeting_date, it does not get a new entry in the table.

Is there an easy way to add these missing forecast_year entries for all meeting_dates, quoting the most recent gdp_growth(%) figures available as of the latest meeting_date?

To clarify, here is the input table df_in:

meeting_date forecast_year gdp_growth (%)
2007-11-20 2007 2.45
2007-11-20 2008 2.15
2007-11-20 2009 2.50
2007-11-20 2010 2.55
2008-02-20 2008 1.65
2008-02-20 2009 2.40
2008-02-20 2010 2.75
2008-05-21 2008 0.75
2008-05-21 2010 2.85
2008-07-16 2008 1.30
2008-07-16 2010 2.75
2008-11-19 2008 0.15
2008-11-19 2009 0.45
2008-11-19 2011 3.20

Here is the shape of the output table I need to output, df_out. Rows that need to be added are in bold. What pandas manipulations are needed to transform df_in into df_out?

meeting_date forecast_year GPD Growth (%)
2007-11-20 2007 2.45
2007-11-20 2008 2.15
2007-11-20 2009 2.50
2007-11-20 2010 2.55
2008-02-20 2008 1.65
2008-02-20 2009 2.40
2008-02-20 2010 2.75
2008-05-21 2008 0.75
2008-05-21 2009 2.40
2008-05-21 2010 2.85
2008-07-16 2008 1.30
2008-07-16 2009 2.40
2008-07-16 2010 2.75
2008-11-19 2008 0.15
2008-11-19 2009 0.45
2008-11-19 2010 2.75
2008-11-19 2011 3.20

Upvotes: 1

Views: 189

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195613

Try:

x = (
    df.set_index(["meeting_date", "forecast_year"])
    .unstack(level=1)
    .ffill()
    .stack()
    .reset_index()
)
# remove rows where meeting_date > forecast_year
x = x[~(pd.to_datetime(x["meeting_date"]).dt.year > x["forecast_year"])]
print(x)

Prints:

   meeting_date  forecast_year  gdp_growth (%)
0    2007-11-20           2007            2.45
1    2007-11-20           2008            2.15
2    2007-11-20           2009            2.50
3    2007-11-20           2010            2.55
5    2008-02-20           2008            1.65
6    2008-02-20           2009            2.40
7    2008-02-20           2010            2.75
9    2008-05-21           2008            0.75
10   2008-05-21           2009            2.40
11   2008-05-21           2010            2.85
13   2008-07-16           2008            1.30
14   2008-07-16           2009            2.40
15   2008-07-16           2010            2.75
17   2008-11-19           2008            0.15
18   2008-11-19           2009            0.45
19   2008-11-19           2010            2.75
20   2008-11-19           2011            3.20

EDIT: Removed MultiIndex.from_product - not needed

Upvotes: 2

Nk03
Nk03

Reputation: 14949

Another way via pivot:

k = df1.pivot(*df1).ffill().stack().reset_index(name = 'GPD Growth (%)')
df = k[~(pd.to_datetime(k["meeting_date"]).dt.year.gt(k["forecast_year"]))]

OUTPUT:

   meeting_date  forecast_year  GPD Growth (%)
0    2007-11-20           2007            2.45
1    2007-11-20           2008            2.15
2    2007-11-20           2009            2.50
3    2007-11-20           2010            2.55
5    2008-02-20           2008            1.65
6    2008-02-20           2009            2.40
7    2008-02-20           2010            2.75
9    2008-05-21           2008            0.75
10   2008-05-21           2009            2.40
11   2008-05-21           2010            2.85
13   2008-07-16           2008            1.30
14   2008-07-16           2009            2.40
15   2008-07-16           2010            2.75
17   2008-11-19           2008            0.15
18   2008-11-19           2009            0.45
19   2008-11-19           2010            2.75
20   2008-11-19           2011            3.20

Upvotes: 3

Related Questions