Reputation: 937
For the following example, I would like to use the value of 1994. The SIC is usually the same across all the years. In rare cases, it could different across time. I have a big panel of 250K observations. Thank you!
Group Num Date SIC
100783 1994-03-28 2621
100783 1995-03-30 NaN
100783 1996-03-28 2621
100783 1997-03-27 2621
100783 1998-03-31 2621
842635 1994-03-17 2911
842635 1995-03-30 NaN
842635 1996-03-20 2911
842635 1997-03-21 2911
842635 1998-03-31 2911
Upvotes: 0
Views: 630
Reputation: 1809
I will extend and provide more guidance on the answer you have been given in the comments by Wen-Ben.
First of all, for this to work you need an ordered DataFrame, based on Group Num
and Date
variables. If you are sure that your DataFrame has already been ordered, you don't need to execute the following command, otherwise a sorting (to be executed`inplace) is really necessary:
df.sort_values(['Group Num', 'Date'], inplace=True)
after your DataFrame is ordered, you can require SIC to be imputed using the nearest previous values, based on the grouping of the variable 'Group Num' (so the nearest previous value will be taken only from its reference group):
df['SIC'] = df.groupby('Group Num')['SIC'].ffill()
the ffill command (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ffill.html) will propagate last valid observation forward to next valid. Therefore, if there is any group starting with a missing value, no imputation will happen. Given the large file, you have better then also launch another imputation using bfill (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html), which uses instead the next valid observation to fill any gap:
df['SIC'] = df.groupby('Group Num')['SIC'].bfill()
That's should be enough to cover any missing occurence and the replacement is done without creating a new DataFrame, but operating on the columns of the existing one.
Upvotes: 1