Reputation: 469
I have data similar to:
dt <-data.table(V1=c(-1,NA,10),V2=c(-2,3,15),V3=c(NA,5,20),V4=c(NA,NA,NA),V5=c(-3,7,NA),
avg=c(-2,5,15),stdev=c(.82,1.63,4.08))
Which looks like:
V1 V2 V3 V4 V5 avg stdev
1 -1 -2 NA NA -3 -2 0.82
2 NA 3 5 NA 7 5 1.63
3 10 15 20 NA NA 15 4.08
I want to replace each NA with a normally distributed random number using the avg
column for the mean and the stdev
column for the standard deviation. For example, in row 1, I want two random numbers (to replace NA's in V3
,V4
) calculated using mean=-2 and sd=0.82. In row 2, I want two random numbers (to replace NA's in V1
,V4
) calculated using mean=5 and sd=1.63, etc. Same logic for all rows. I tried this and it didn't work as expected:
for (col in colnames(dt)) dt[is.na(get(col)), (col) := rnorm(1,mean=(avg),sd=(stdev))]
I welcome any ideas. Thanks.
Upvotes: 1
Views: 395
Reputation: 1798
You are very close.
library(data.table)
dt <-data.table(V1=c(-1,NA,10),V2=c(-2,3,15),V3=c(NA,5,20),V4=c(NA,NA,NA),V5=c(-3,7,NA),
avg=c(-2,5,15),stdev=c(.82,1.63,4.08))
# change V4 to numeric, otherwise is logical
dt[, V4 := as.numeric(V4)]
set.seed(1234)
for (col in names(dt)[1:5]){
# rnorm() is more annoying than expected, n = 1 cannot recycle
dt[is.na(get(col)), (col) := as.numeric(rnorm(rep(1, sum(is.na(get(col)))), avg, stdev))]
}
# V1 V2 V3 V4 V5 avg stdev
# 1: -1.000000 -2 -1.772508 -1.110758 -3.00000 -2 0.82
# 2: 3.032483 3 5.000000 1.176513 7.00000 5 1.63
# 3: 10.000000 15 20.000000 16.750829 17.06471 15 4.08
Upvotes: 2
Reputation: 703
I'm going to presume that there isn't anything special about avg
and stdev
columns and that they are just rowwise means and standard deviations.
First a helper imputation function
impute <- function(x) {
avg <- mean(x, na.rm = TRUE)
stdev <- sd(x, na.rm = TRUE)
indices <- which(is.na(x))
x[indices] <- rnorm(length(indices), avg, stdev)
x
}
Next strip out the avg
and stdev
columns.
no_avg <- dt[, !(colnames(dt) %in% c("avg", "stdev"))]
Then apply the impute
rowwise. data.table
and t
just get us back to the original format.
data.table(t(apply(no_avg, 1, impute)))
Upvotes: 1