Pandas DataFrame - Fill NaNs of columns based on values of other columns

Question

I have a wide data frame with several years:

df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
                  columns=pd.Index([1991, 1992, 1993, 1994, 1995, 1996, '2010-2012'], name='Year'),
                  data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19, np.NaN],
                                   [16, 17, 18, 19, 20, 21, np.NaN],
                                   [np.NaN, np.NaN, np.NaN, np.NaN, 16, 17, 31],
                                   [np.NaN, 22, 23, 24, np.NaN, 26, np.NaN],
                                   [36, 36, 37, 38, 39, 40, 55]]))

Year     1991  1992  1993  1994  1995  1996  2010-2012
ID                                                    
29925     NaN   NaN  16.0  17.0  18.0  19.0        NaN
223725   16.0  17.0  18.0  19.0  20.0  21.0        NaN
280165    NaN   NaN   NaN   NaN  16.0  17.0       31.0
813285    NaN  22.0  23.0  24.0   NaN  26.0        NaN
956765   36.0  36.0  37.0  38.0  39.0  40.0       55.0

The values in each row are the age of each person, with each holding a unique ID. I want to fill the NaN of this data frame in each year of every row, based on the existing age values in each row.

For example, ID 29925 is 16 in 1993, we know they are 15 in 1992 and 14 in 1991, therefore we want to replace the NaN for 29925 in the columns 1992 and 1991. Similarly, I want to replace the NaN in the column2010-2012 based on the existing age values for 29925. Let's assume that 29925 is 15 years older from 1996 in the 2010-2012 column. What is the fastest way to do this for the whole data frame - i.e for all IDs?

zabop · Accepted Answer

# imports we need later
import numpy as np
import pandas as pd

This is a not a particularly efficient method but it works. I'll leave out your last column, to make things more systematic.

The df:

df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
                  columns=pd.Index([1992, 1992, 1993, 1994, 1995, 1996], name='Year'),
                  data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19],
                                   [16, 17, 18, 19, 20, 21],
                                   [np.NaN, np.NaN, np.NaN, np.NaN, 16, 17],
                                   [np.NaN, 22, 23, 24, np.NaN, 26],
                                   [35, 36, 37, 38, 39, 40]]))

Calculate date of birth for everyone:

dob=[]
for irow, row in enumerate(df.iterrows()):
    dob.append(np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]))

or, if you are into list comprehensions:

dob = [np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]) for irow, row in enumerate(df.iterrows())]

Now dob is like this:

[array([  nan,   nan, 1977., 1977., 1977., 1977.]),
 array([1976., 1975., 1975., 1975., 1975., 1975.]),
 array([  nan,   nan,   nan,   nan, 1979., 1979.]),
 array([  nan, 1970., 1970., 1970.,   nan, 1970.]),
 array([1956., 1956., 1956., 1956., 1956., 1956.])]

Make a simpler dob list using np.unique, remove nans:

dob_filtered=[np.unique(each[~np.isnan(each)])[0] for each in dob]

dob_filtered now looks like this:

[1977.0, 1975.0, 1979.0, 1970.0, 1956.0]

Attach this list to dataframe:

df['dob']=dob_filtered

Fill in the NaNs of the df using the dob column:

for irow, row in enumerate(df.index):
    for icol, col in enumerate(df.columns[:-2]):
        df.loc[row,col] = col - df['dob'][row]

Delete the dob column (just to obtain the original columns only, otherwise not important):

df.drop(['dob'],axis=1)

Obtaining:

Year    1992    1992    1993    1994    1995    1996
ID                      
29925   15.0    15.0    16.0    17.0    18.0    19.0
223725  17.0    17.0    18.0    19.0    20.0    21.0
280165  13.0    13.0    14.0    15.0    16.0    17.0
813285  22.0    22.0    23.0    24.0    25.0    26.0
956765  36.0    36.0    37.0    38.0    39.0    40.0

ie

Pandas DataFrame - Fill NaNs of columns based on values of other columns

Answers (1)

Related Questions