Reputation: 181
I have a large DataFrame with the following columns:
import pandas as pd
x = pd.read_csv('age_year.csv')
x.head()
ID Year age
22445 1991
29925 1991
76165 1991
223725 1991 16.0
280165 1991
The Year
column has values ranging from 1991
to 2017
. Most ID
have an age
value in each Year
, for example:
x.loc[x['ID'] == 280165].to_clipboard(index = False)
ID Year age
280165 1991
280165 1992
280165 1993
280165 1994
280165 1995 16.0
280165 1996 17.0
280165 1997 18.0
280165 1998 19.0
280165 1999 20.0
280165 2000 21.0
280165 2001
280165 2002
280165 2003
280165 2004 25.0
280165 2005 26.0
280165 2006 27.0
280165 2007
280165 2008
280165 2010 31.0
280165 2011 32.0
280165 2012 33.0
280165 2013 34.0
280165 2014 35.0
280165 2015 36.0
280165 2016 37.0
280165 2017 38.0
I want to fill the missing values in the age
column for each unique ID
based on their existing values. For example, for ID
280165 above, we know they are 29
in 2008
, given that they are 31
in 2010
(28 in 2007, 24 in 2003 and so on).
How should one fill in these missing age
values for many unique ID
for every year? I'm not sure how to do this in a uniform way across the entire DataFrame. The data used as the example in this question can be found here.
Upvotes: 7
Views: 815
Reputation: 22493
I think instead of trying to fill the values, find the year of birth instead.
df["age"] = df["Year"] - (df["Year"]-df["age"]).mean()
Or general solution with more than 1 id:
s = df.loc[df["age"].notnull()].groupby("ID").first()
df["age"] = df["Year"]-df["ID"].map(s["Year"]-s["age"])
print (df)
ID Year age
0 280165 1991 12.0
1 280165 1992 13.0
2 280165 1993 14.0
3 280165 1994 15.0
4 280165 1995 16.0
5 280165 1996 17.0
6 280165 1997 18.0
7 280165 1998 19.0
8 280165 1999 20.0
9 280165 2000 21.0
10 280165 2001 22.0
11 280165 2002 23.0
12 280165 2003 24.0
13 280165 2004 25.0
14 280165 2005 26.0
15 280165 2006 27.0
16 280165 2007 28.0
17 280165 2008 29.0
18 280165 2010 31.0
19 280165 2011 32.0
20 280165 2012 33.0
21 280165 2013 34.0
22 280165 2014 35.0
23 280165 2015 36.0
24 280165 2016 37.0
25 280165 2017 38.0
Upvotes: 3
Reputation: 2417
try doing:
def get_age(s):
present = s.age.notna().idxmax()
diff = s.loc[[present]].eval('age - Year').iat[0]
s['age'] = diff + s.Year
return s
df.groupby(['ID']).apply(get_age)
Upvotes: 5