Reputation: 543
Assume we have a data.table
like:
library(data.table)
set.seed(123666)
dt <- data.table(
id = seq(1, 5),
sample1 = c(sample(c(NA, runif(2))), NA),
sample2 = c(NA, sample(c(NA, runif(3)))),
sample3 = c(sample(c(NA, runif(4))))
)
dt
id sample1 sample2 sample3
1: 1 NA NA 0.6387276
2: 2 0.9293370 0.1875354 0.2087892
3: 3 0.1528115 NA 0.7849779
4: 4 NA 0.6875024 0.3684756
5: 5 NA 0.4859773 NA
Its have many NA values, now, we want to fill it, typically, we can use following syntax
to do
dt[is.na(dt)] <- 0
dt
id sample1 sample2 sample3
1: 1 0.0000000 0.0000000 0.6387276
2: 2 0.9293370 0.1875354 0.2087892
3: 3 0.1528115 0.0000000 0.7849779
4: 4 0.0000000 0.6875024 0.3684756
5: 5 0.0000000 0.4859773 0.0000000
However, if we want to fill NA
with more complex rule, a custom function for example, calc_data()
, to calc NA
. This function need two input, just a example here, first is the id
value of NA value, secound is the colname or colname index of the cell.
# example, not real function
sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
calc_data <- function(sample, id) {
na_calc <- id * 3 + sample_value[sample]
}
Now, it is possible to fill NA
with this coustom function with data.table syntax
. how to put its required value to calc_data
Upvotes: 3
Views: 63
Reputation: 160407
An alternative that isn't as "efficient" as set
(which is the fastest data.table
-canonical way to work on its data), we can do
dt
# id sample1 sample2 sample3
# <int> <num> <num> <num>
# 1: 1 NA NA 0.639
# 2: 2 0.929 0.188 0.209
# 3: 3 0.153 NA 0.785
# 4: 4 NA 0.688 0.368
# 5: 5 NA 0.486 NA
sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
fun <- function(x, samp) fcoalesce(x, calc_data(samp, id))
dt[, names(sample_value) := Map(fun, .SD, sample_value),
.SDcols = names(sample_value)]
# id sample1 sample2 sample3
# <int> <num> <num> <num>
# 1: 1 4.000 6.000 0.639
# 2: 2 0.929 0.188 0.209
# 3: 3 0.153 12.000 0.785
# 4: 4 13.000 0.688 0.368
# 5: 5 16.000 0.486 18.000
Map
is similar to lapply
(call a FUN
ction once for each element in the list X
), but while lapply
can operate only on one list, Map
"zips" multiple lists/vectors together. In this case, if we were to unroll the Map
into a sequence of expressions, we'd effectively get:
fun(sample1, 1)
fun(sample2, 3)
fun(sample3, 3)
(I used fun
as a predefined function for clarity and breakout, though it could easily be used as an anonymous function in the call to Map
, as in
Map(function(x, samp) fcoalesce(x, calc_data(samp, id)), .SD, sample_value)
fcoalesce
returns (element-wise along vectors) the first non-NA
value found. It takes one or more vectors (I'm using two above, three here for demo). For example,
fcoalesce(c(1,2,NA), c(3,NA,4), c(NA,5,6))
# [1] 1 2 4
The 1,2
from the first argument are retained. The third position is NA
, so it shifts to the third position in the second argument and finds 4
, so it returns that as the third. In this case, nothing in the third argument c(NA,5,6)
is needed.
Using fcoalesce
here is effectively the same as
fifelse(is.na(x), calc_data(samp, id), x)
which can also be used in the function if you prefer.
I'm (ab)using the names of sample_value
to iterate (in .SDcols
) and save (as names(sample_values) :=
).
Upvotes: 2
Reputation: 27732
perhaps something like this could work
for(j in 2:4) set(dt,
i = which(is.na(dt[[j]])),
j = j,
value = calc_data(j - 1, dt[which(is.na(dt[[j]])), "id"]))
output
id sample1 sample2 sample3
1: 1 0.8055845 6.00000000 0.2030456
2: 2 0.5705721 9.00000000 0.7954992
3: 3 10.0000000 0.09605308 12.0000000
4: 4 13.0000000 0.25545666 0.6506906
5: 5 0.8055845 0.51889032 0.8931946
Upvotes: 0