How to fill missing values using values of closest neighboring years for a Pandas panel?

Question

For the following example, I would like to use the value of 1994. The SIC is usually the same across all the years. In rare cases, it could different across time. I have a big panel of 250K observations. Thank you!

Group Num   Date    SIC
100783  1994-03-28  2621
100783  1995-03-30  NaN
100783  1996-03-28  2621
100783  1997-03-27  2621
100783  1998-03-31  2621
842635  1994-03-17  2911
842635  1995-03-30  NaN
842635  1996-03-20  2911
842635  1997-03-21  2911
842635  1998-03-31  2911

Luca Massaron · Accepted Answer

I will extend and provide more guidance on the answer you have been given in the comments by Wen-Ben.

First of all, for this to work you need an ordered DataFrame, based on Group Num and Date variables. If you are sure that your DataFrame has already been ordered, you don't need to execute the following command, otherwise a sorting (to be executed`inplace) is really necessary:

df.sort_values(['Group Num', 'Date'], inplace=True)

after your DataFrame is ordered, you can require SIC to be imputed using the nearest previous values, based on the grouping of the variable 'Group Num' (so the nearest previous value will be taken only from its reference group):

df['SIC'] = df.groupby('Group Num')['SIC'].ffill()

the ffill command (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ffill.html) will propagate last valid observation forward to next valid. Therefore, if there is any group starting with a missing value, no imputation will happen. Given the large file, you have better then also launch another imputation using bfill (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html), which uses instead the next valid observation to fill any gap:

df['SIC'] = df.groupby('Group Num')['SIC'].bfill()

That's should be enough to cover any missing occurence and the replacement is done without creating a new DataFrame, but operating on the columns of the existing one.

How to fill missing values using values of closest neighboring years for a Pandas panel?

Answers (1)

Related Questions