Reputation: 63
I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated
Upvotes: 0
Views: 431
Reputation: 63
Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?
Upvotes: 0