zhang
zhang

Reputation: 543

data.table fill NA by custom function and other cells

Assume we have a data.table like:

library(data.table)

set.seed(123666)


dt <- data.table(
  id = seq(1, 5), 
  sample1 = c(sample(c(NA, runif(2))), NA),      
  sample2 = c(NA, sample(c(NA, runif(3)))),    
  sample3 = c(sample(c(NA, runif(4))))      
)
dt
   id   sample1   sample2   sample3
1:  1        NA        NA 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115        NA 0.7849779
4:  4        NA 0.6875024 0.3684756
5:  5        NA 0.4859773        NA

Its have many NA values, now, we want to fill it, typically, we can use following syntax to do

dt[is.na(dt)] <- 0
dt
   id   sample1   sample2   sample3
1:  1 0.0000000 0.0000000 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115 0.0000000 0.7849779
4:  4 0.0000000 0.6875024 0.3684756
5:  5 0.0000000 0.4859773 0.0000000

However, if we want to fill NA with more complex rule, a custom function for example, calc_data(), to calc NA. This function need two input, just a example here, first is the id value of NA value, secound is the colname or colname index of the cell.

# example, not real function
sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
calc_data <- function(sample, id) {
    na_calc <- id * 3 + sample_value[sample]
}

Now, it is possible to fill NA with this coustom function with data.table syntax. how to put its required value to calc_data

Upvotes: 3

Views: 63

Answers (2)

r2evans
r2evans

Reputation: 160407

An alternative that isn't as "efficient" as set (which is the fastest data.table-canonical way to work on its data), we can do

dt
#       id sample1 sample2 sample3
#    <int>   <num>   <num>   <num>
# 1:     1      NA      NA   0.639
# 2:     2   0.929   0.188   0.209
# 3:     3   0.153      NA   0.785
# 4:     4      NA   0.688   0.368
# 5:     5      NA   0.486      NA

sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
fun <- function(x, samp) fcoalesce(x, calc_data(samp, id))
dt[, names(sample_value) := Map(fun, .SD, sample_value),
   .SDcols = names(sample_value)]
#       id sample1 sample2 sample3
#    <int>   <num>   <num>   <num>
# 1:     1   4.000   6.000   0.639
# 2:     2   0.929   0.188   0.209
# 3:     3   0.153  12.000   0.785
# 4:     4  13.000   0.688   0.368
# 5:     5  16.000   0.486  18.000
  • Map is similar to lapply (call a FUNction once for each element in the list X), but while lapply can operate only on one list, Map "zips" multiple lists/vectors together. In this case, if we were to unroll the Map into a sequence of expressions, we'd effectively get:

    fun(sample1, 1)
    fun(sample2, 3)
    fun(sample3, 3)
    

    (I used fun as a predefined function for clarity and breakout, though it could easily be used as an anonymous function in the call to Map, as in

    Map(function(x, samp) fcoalesce(x, calc_data(samp, id)), .SD, sample_value)
    
  • fcoalesce returns (element-wise along vectors) the first non-NA value found. It takes one or more vectors (I'm using two above, three here for demo). For example,

    fcoalesce(c(1,2,NA), c(3,NA,4), c(NA,5,6))
    # [1] 1 2 4
    

    The 1,2 from the first argument are retained. The third position is NA, so it shifts to the third position in the second argument and finds 4, so it returns that as the third. In this case, nothing in the third argument c(NA,5,6) is needed.

    Using fcoalesce here is effectively the same as

    fifelse(is.na(x), calc_data(samp, id), x)
    

    which can also be used in the function if you prefer.

  • I'm (ab)using the names of sample_value to iterate (in .SDcols) and save (as names(sample_values) :=).

Upvotes: 2

Wimpel
Wimpel

Reputation: 27732

perhaps something like this could work

for(j in 2:4) set(dt, 
                  i = which(is.na(dt[[j]])), 
                  j = j, 
                  value = calc_data(j - 1, dt[which(is.na(dt[[j]])), "id"]))

output

   id    sample1    sample2    sample3
1:  1  0.8055845 6.00000000  0.2030456
2:  2  0.5705721 9.00000000  0.7954992
3:  3 10.0000000 0.09605308 12.0000000
4:  4 13.0000000 0.25545666  0.6506906
5:  5  0.8055845 0.51889032  0.8931946

Upvotes: 0

Related Questions