sdaza
sdaza

Reputation: 1052

imputing age based on sequence of years

I would like to impute age using year information. I have a dataset with the following characteristics:

dat <- data.table(id = c(rep(1, 8), rep(2, 8)), 
                  year = c(2007:2014, 2007:2014), 
                  age = c(1, NA, 3, NA, NA, 5, 7, NA, NA, NA, 30, NA, 32, 35, NA, NA),
                  age_imp= c(1, 2, 3, 4, 5, 5, 7, 8, 28, 29, 30, 31, 32, 35, 36, 37)
)


    id year age age_imp
 1:  1 2007   1       1
 2:  1 2008  NA       2
 3:  1 2009   3       3
 4:  1 2010  NA       4
 5:  1 2011  NA       5
 6:  1 2012   5       5
 7:  1 2013   7       7
 8:  1 2014  NA       8
 9:  2 2007  NA      28
10:  2 2008  NA      29
11:  2 2009  30      30
12:  2 2010  NA      31
13:  2 2011  32      32
14:  2 2012  35      35
15:  2 2013  NA      36
16:  2 2014  NA      37

The original variable age doesn't always match a yearly duration (e.g., a interview was applied less than a year from previous interview, measurement error, etc.), so I want to keep it like it is. For the NA rows, I would like to start a sequence by year (e.g., age_imp).

Any suggestions on how to do it?

Upvotes: 0

Views: 91

Answers (2)

sdaza
sdaza

Reputation: 1052

I finally created this function:

impute.age <- function(age) {
  if (any(is.na(age))) {
  min.age <- min(age, na.rm = TRUE)
  position <- which(age == min.age)[1] # ties
  if (!is.na(position)) {
   if (position > 1) { # initial values
    for (i in 1:(position-1)) {
      age[position - i] <- age[position] - i
    }
    }
  missing <- which(is.na(age)) # missing data position
  for (i in missing) {
    age[i] = age[i-1] + 1
  }
  } else { age = as.numeric(NA) }
}
return(age)
}

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

You can first use the first non NA age to form a linear equation and linearly interpolate & extrapolate within each id without handling jumps first.

Then, identify where the jumps/steps in age are for each id.

Then, interpolate and extrapolate for each group (i.e. pair of id and steps) again taking account of jumps.

More explanation inline..

#ensure order is correct before using shift
setorder(dat, id, year)

#' Fill NA by interpolating and extrapolating using a known point
#' 
#' @param dt - data.table
#' @param years - the xout that are required
#' 
#' @return a numeric vector of ages given the years
#' 
extrapolate <- function(dt, years) {
    #find the first non NA entry
    firstnonNA <- head(dt[!is.na(age)], 1)

    #using linear equation y - y_1 = 1 * (x - x_1)
    as.numeric(sapply(years, function(x) (x - firstnonNA$year) + firstnonNA$age))
}

#interp and extrap age for years that are missing age assuming linearity without jumps
dat[, imp1 := extrapolate(.SD, year), by="id"]

#identifying when the age jumps up/down
dat[, jump:=cumsum(
        (!is.na(age) & imp1!=age) |
        (!is.na(age) & !is.na(shift(age)) & (age+1)!=shift(age))
    ), by="id"]

#interp and extrap age for years taking into account jumps
dat[, age_imp1 := extrapolate(.SD, year), by=c("id","jump")]

#print results
dat[,c("imp1","jump"):=NULL][]

#check if the results are identical as requested
dat[, identical(age_imp, age_imp1)]

Upvotes: 1

Related Questions