Reputation: 181
I have a wide data frame with several years:
df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
columns=pd.Index([1991, 1992, 1993, 1994, 1995, 1996, '2010-2012'], name='Year'),
data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19, np.NaN],
[16, 17, 18, 19, 20, 21, np.NaN],
[np.NaN, np.NaN, np.NaN, np.NaN, 16, 17, 31],
[np.NaN, 22, 23, 24, np.NaN, 26, np.NaN],
[36, 36, 37, 38, 39, 40, 55]]))
Year 1991 1992 1993 1994 1995 1996 2010-2012
ID
29925 NaN NaN 16.0 17.0 18.0 19.0 NaN
223725 16.0 17.0 18.0 19.0 20.0 21.0 NaN
280165 NaN NaN NaN NaN 16.0 17.0 31.0
813285 NaN 22.0 23.0 24.0 NaN 26.0 NaN
956765 36.0 36.0 37.0 38.0 39.0 40.0 55.0
The values in each row are the age of each person, with each holding a unique ID. I want to fill the NaN
of this data frame in each year of every row, based on the existing age values in each row.
For example, ID 29925
is 16 in 1993
, we know they are 15 in 1992
and 14 in 1991
, therefore we want to replace the NaN
for 29925
in the columns 1992
and 1991
. Similarly, I want to replace the NaN
in the column2010-2012
based on the existing age values for 29925
. Let's assume that 29925
is 15 years older from 1996
in the 2010-2012
column. What is the fastest way to do this for the whole data frame - i.e for all IDs?
Upvotes: 3
Views: 258
Reputation: 7922
# imports we need later
import numpy as np
import pandas as pd
This is a not a particularly efficient method but it works. I'll leave out your last column, to make things more systematic.
The df
:
df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
columns=pd.Index([1992, 1992, 1993, 1994, 1995, 1996], name='Year'),
data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19],
[16, 17, 18, 19, 20, 21],
[np.NaN, np.NaN, np.NaN, np.NaN, 16, 17],
[np.NaN, 22, 23, 24, np.NaN, 26],
[35, 36, 37, 38, 39, 40]]))
Calculate date of birth for everyone:
dob=[]
for irow, row in enumerate(df.iterrows()):
dob.append(np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]))
or, if you are into list comprehensions:
dob = [np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]) for irow, row in enumerate(df.iterrows())]
Now dob
is like this:
[array([ nan, nan, 1977., 1977., 1977., 1977.]),
array([1976., 1975., 1975., 1975., 1975., 1975.]),
array([ nan, nan, nan, nan, 1979., 1979.]),
array([ nan, 1970., 1970., 1970., nan, 1970.]),
array([1956., 1956., 1956., 1956., 1956., 1956.])]
Make a simpler dob list using np.unique, remove nans:
dob_filtered=[np.unique(each[~np.isnan(each)])[0] for each in dob]
dob_filtered
now looks like this:
[1977.0, 1975.0, 1979.0, 1970.0, 1956.0]
Attach this list to dataframe:
df['dob']=dob_filtered
Fill in the NaN
s of the df
using the dob
column:
for irow, row in enumerate(df.index):
for icol, col in enumerate(df.columns[:-2]):
df.loc[row,col] = col - df['dob'][row]
Delete the dob
column (just to obtain the original columns only, otherwise not important):
df.drop(['dob'],axis=1)
Obtaining:
Year 1992 1992 1993 1994 1995 1996
ID
29925 15.0 15.0 16.0 17.0 18.0 19.0
223725 17.0 17.0 18.0 19.0 20.0 21.0
280165 13.0 13.0 14.0 15.0 16.0 17.0
813285 22.0 22.0 23.0 24.0 25.0 26.0
956765 36.0 36.0 37.0 38.0 39.0 40.0
ie
Upvotes: 2