Rolling mean with varying window length in Python

Question

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.

I would like my smoothed income variable to satisfy the following criteria:

1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward

2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date

3) It should ignore NaNs

It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)

id year  income 'smoothed income'

1  1979  20,000  21,250  
1  1980  22,000  
1  1981  21,000
1  1982  22,000
...
1  2014  34,000   34,500
1  2016  35,000   
2  1979  28,000   28,333
2  1980  NaN
2  1981  28,000
2  1982  29,000

I am relatively new to dataframe manipulation with pandas, so here is what I have tried:

smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] =  smooth.reset_index(level=0, drop=True)

This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).

Any help would be much appreciated

EB727 · Accepted Answer

Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.

Here's the modified code:

df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
                                                           .rolling(4, min_periods = 1).mean().shift(-3)).reset_index()

The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?

Rolling mean with varying window length in Python

Answers (1)

Related Questions