MI MA
MI MA

Reputation: 181

Pandas DataFrame Filling missing values in a column

I have a large DataFrame with the following columns:

import pandas as pd 

x = pd.read_csv('age_year.csv')
x.head()

ID  Year    age
22445   1991    
29925   1991    
76165   1991    
223725  1991    16.0
280165  1991    

The Year column has values ranging from 1991 to 2017. Most ID have an age value in each Year, for example:

x.loc[x['ID'] == 280165].to_clipboard(index = False)

ID  Year    age
280165  1991    
280165  1992    
280165  1993    
280165  1994    
280165  1995    16.0
280165  1996    17.0
280165  1997    18.0
280165  1998    19.0
280165  1999    20.0
280165  2000    21.0
280165  2001    
280165  2002    
280165  2003    
280165  2004    25.0
280165  2005    26.0
280165  2006    27.0
280165  2007    
280165  2008    
280165  2010    31.0
280165  2011    32.0
280165  2012    33.0
280165  2013    34.0
280165  2014    35.0
280165  2015    36.0
280165  2016    37.0
280165  2017    38.0

I want to fill the missing values in the age column for each unique ID based on their existing values. For example, for ID 280165 above, we know they are 29 in 2008, given that they are 31 in 2010 (28 in 2007, 24 in 2003 and so on).

How should one fill in these missing age values for many unique ID for every year? I'm not sure how to do this in a uniform way across the entire DataFrame. The data used as the example in this question can be found here.

Upvotes: 7

Views: 815

Answers (2)

Henry Yik
Henry Yik

Reputation: 22493

I think instead of trying to fill the values, find the year of birth instead.

df["age"] =  df["Year"] - (df["Year"]-df["age"]).mean()

Or general solution with more than 1 id:

s = df.loc[df["age"].notnull()].groupby("ID").first()

df["age"] = df["Year"]-df["ID"].map(s["Year"]-s["age"])

print (df)

        ID  Year   age
0   280165  1991  12.0
1   280165  1992  13.0
2   280165  1993  14.0
3   280165  1994  15.0
4   280165  1995  16.0
5   280165  1996  17.0
6   280165  1997  18.0
7   280165  1998  19.0
8   280165  1999  20.0
9   280165  2000  21.0
10  280165  2001  22.0
11  280165  2002  23.0
12  280165  2003  24.0
13  280165  2004  25.0
14  280165  2005  26.0
15  280165  2006  27.0
16  280165  2007  28.0
17  280165  2008  29.0
18  280165  2010  31.0
19  280165  2011  32.0
20  280165  2012  33.0
21  280165  2013  34.0
22  280165  2014  35.0
23  280165  2015  36.0
24  280165  2016  37.0
25  280165  2017  38.0

Upvotes: 3

Ayoub ZAROU
Ayoub ZAROU

Reputation: 2417

try doing:

def get_age(s):
    present = s.age.notna().idxmax()
    diff = s.loc[[present]].eval('age - Year').iat[0]
    s['age'] = diff + s.Year
    return s

df.groupby(['ID']).apply(get_age)

Upvotes: 5

Related Questions