Reputation: 11762
I have a data frame in long format and I want to replace missing values by random numbers, but I want to do this group wise with different settings...
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
So my data looks like above, in the imp_df
I have the settings for the rnorm
function based on the experiment exp
and the replicate rep
.
My data frame has then some missing values and I want to replace the NA
by the random numbers.
How can I do it using dplyr
or tidyr
?
After the answer from @starja, I found a quick, but maybe slow solution by using rowwise
together with left_join
.
df %>%
left_join(imp_df) %>%
rowwise() %>%
mutate(imp.int=if_else(
is.na(int),
rnorm(1, mean, sd),
int
)) %>%
print(n=60)
Are there other ways to do this?
Since the rowwise
approach is pretty slow and I couldn't get it running within some dplyr code, I used a for loop to go through imp_df
with the imputation settings.
This is a pretty quick solution, but not as readable as I was hoping:
df$imp.int <- df$int
for(line in 1:nrow(imp_df)) {
imp_settings <- as.list(imp_df[line,])
rows_missing_values <- which(
df$exp == imp_settings$exp &
df$rep == imp_settings$rep &
is.na(df$imp.int)
)
df$imp.int[rows_missing_values] <-
stats::rnorm(length(rows_missing_values), imp_settings$mean, imp_settings$sd)
}
So we first add a column imp.int
for the imputed values and run now line by line the different imputation settings by replacing the NAs for each group.
Upvotes: 0
Views: 635
Reputation: 21938
This could also be done:
library(dplyr)
library(purrr)
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = ifelse(is.na(int),
map2(mean, sd, ~ rnorm(1, .x, .y)), int))
exp rep int mean sd
1 a 1 1 1 0.0
2 a 2 10.91822 2 0.1
3 a 1 5.821857 1 0.0
4 a 2 17.9764 2 0.1
5 a 1 11.64754 1 0.0
6 a 2 5.897658 2 0.1
7 a 1 12.43715 1 0.0
8 a 2 13.69162 2 0.1
9 a 1 12.87891 1 0.0
10 a 2 1.986482 2 0.1
Upvotes: 3
Reputation: 10365
I guess there are cleverer solutions out there that use vectorisation, but if you don't have super large data, I like to use a purrr::map
function for this together with a small custom made function:
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
replace_fun <- function(x, mean, sd) {
if (is.na(x)) {
rnorm(1, mean, sd)
} else {
x
}
}
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = purrr::pmap_dbl(list(int, mean, sd), replace_fun)) %>%
head()
#> exp rep int mean sd
#> 1 a 1 1.000000 1 0.0
#> 2 a 2 10.918217 2 0.1
#> 3 a 1 5.821857 1 0.0
#> 4 a 2 17.976404 2 0.1
#> 5 a 1 11.647539 1 0.0
#> 6 a 2 5.897658 2 0.1
Created on 2021-05-27 by the reprex package (v0.3.0)
(If you want, you can remove the mean/sd columns with select(-c(mean, sd))
.)
Upvotes: 1