bziggy
bziggy

Reputation: 463

R data.table creating a custom function using lapply to create and reassign multiple variables

I have the following lines of code :

DT[flag==T, temp:=haz_1.5]
DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]
DT[agedays==61, haz_1.5_1:=temp]

I need to convert this into a function, so that it will work on a list of variables, instead of just one single one. I have recently learned how to create a function using lapply by passing through a list of columns and conditions for the creation of one set of new columns. However I'm unsure of how to do it when I'm passing through a list of columns as well as carrying through all values of a variable forward on these columns.

For instance, I can code the following :

  columns<-c("haz_1.5", "waz_1.5")
  new_cols <- paste(columns, "1", sep = "_")
  x=61
  maled_anthro[(flag==TRUE)&(agedays==x), (new_cols) := lapply(.SD, function(y) na.locf(y,    na.rm=F)), .SDcols = columns] 

But I am missing the na.locf step and thus am not getting the same output as the original lines of code prior to building the function. How would I incorporate the line of code which utilizes na.locf to carry forward values (DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]) into this function in a way in which all the data is wrapped up into the single function? Would this work with lapply in the same manner?

Dummy data that's similar to the data table I'm using :

DT <- data.table(pid  = c(1,1,2,3,3,4,4,5,5,5),
                 flag = c(T,T,F,T,T,F,T,T,T,T),
                 agedays = c(1,61,61,51,61,23,61,1,32,61),
                 haz_1.5 = c(1,1,1,2,NA,1,3,2,3,4),
                 waz_1.5 = c(1,NA,NA,NA,NA,2,2,3,4,4))

Upvotes: 2

Views: 2059

Answers (1)

Uwe
Uwe

Reputation: 42564

OP's code can be turned into an anonymous function which is applied to the selected columns:

library(data.table)
columns <- c("haz_1.5", "waz_1.5")
new_cols <- paste0(columns, "_1")
x <-  61

DT[, (new_cols) := lapply(.SD, function(v) {
  temp <- fifelse(flag, v, NA_real_)
  temp <- nafill(temp, "locf")
  fifelse(agedays == x, temp, NA_real_)
}), .SDcols = columns, by = pid][]
    pid  flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
 1:   1  TRUE       1       1       1        NA        NA
 2:   1  TRUE      61       1      NA         1         1
 3:   2 FALSE      61       1      NA        NA        NA
 4:   3  TRUE      51       2      NA        NA        NA
 5:   3  TRUE      61      NA      NA         2        NA
 6:   4 FALSE      23       1       2        NA        NA
 7:   4  TRUE      61       3       2         3         2
 8:   5  TRUE       1       2       3        NA        NA
 9:   5  TRUE      32       3       4        NA        NA
10:   5  TRUE      61       4       4         4         4

This is the same result we would get when we manually repeat OP's code for the two columns (note that it is required to clear the temp column before assigning by reference parts of it.)

DT[(flag), temp := haz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, haz_1.5_1 := temp]
DT[, temp := NULL]
DT[(flag), temp := waz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, waz_1.5_1 := temp]
DT[, temp := NULL][]
    pid  flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1
 1:   1  TRUE       1       1       1        NA        NA
 2:   1  TRUE      61       1      NA         1         1
 3:   2 FALSE      61       1      NA        NA        NA
 4:   3  TRUE      51       2      NA        NA        NA
 5:   3  TRUE      61      NA      NA         2        NA
 6:   4 FALSE      23       1       2        NA        NA
 7:   4  TRUE      61       3       2         3         2
 8:   5  TRUE       1       2       3        NA        NA
 9:   5  TRUE      32       3       4        NA        NA
10:   5  TRUE      61       4       4         4         4

Some explanations

  • There is one important difference between OP's "single column" code and this approach: The anonymous function is called for each item in the grouping variable pid. In OP's code, the first and last assignments are working on the ungrouped (full) vectors (which might be somewhat more efficient, perhaps). However, the result of those assignments is independent of pid and the result is the same.
  • Instead of zoo::na.locf(), data.table's nafill() function is used (new with data.table v1.12.4, on CRAN 03 Oct 2019)
  • DT[(flag), ...] is equivalent to DT[flag == TRUE, ...]
  • When fifelse() is used instead of subsetted assign by reference, the no parameter must be NA to be compliant. Thus, DT[, temp := fifelse(flag, haz_1.5, NA_real_)][] is equivalent to DT[(flag), temp := haz_1.5][]

Upvotes: 3

Related Questions