Replacing missing values with mean from another dataset, by factor

Question

I have a data-frame like iris which I'll be splitting into train and test sets. Previously, I could use group_by and mutate from dplyr to replace the mean of missing values by the mean of that species within the attribute, for each attribute.

I now realise that correct ML practice requires first splitting into training and test, then I should impute the mean on the test set using the training means.

Can anyone help do this?

If unclear, let's take an example with iris.

iris0 <- iris

iris0[sample(150,50,replace=FALSE),1] <- NA
iris0[sample(150,50,replace=FALSE),2] <- NA
iris0[sample(150,50,replace=FALSE),3] <- NA
iris0[sample(150,50,replace=FALSE),4] <- NA

set.seed(123)
sampled <- sample(150,50,replace=FALSE)
iris_test <- iris0[sampled,]
iris_train <- iris0[-sampled,]

Now we come to imputation. I can get the means by class in iris_train:

iris_train %>% group_by(Species) %>% summarise_all(mean,na.rm=TRUE)

1 setosa          5.07        3.49         1.5 
2 versica         5.89        2.84         4.33
3 virginica       6.56        2.97         5.52

Now, as I said, filling in NAs in iris_train with these means can be done with dplyr (group-by and mutate). But I would like to replace the NAs in iris_test with the same means, i.e. setosa species fill in any NA in the Sepal.Length column with 5.07, versica species with 5.89, virginica fills in any NA in the Sepal.Width column with 2.97 etc.

Only base-R and dplyr please!

Ronak Shah · Accepted Answer

We can store the results of mean in iris_mean, left_join it with iris_test and replace the NA values.

iris_mean <- iris_train %>% group_by(Species) %>% summarise_all(mean,na.rm=TRUE)
temp <- iris_test %>%  left_join(iris_mean, by = 'Species')
col1 <- 1:4
col2 <- 6:9
temp[col1][is.na(temp[col1])] <- temp[col2][is.na(temp[col1])]
temp[1:5]

Replacing missing values with mean from another dataset, by factor

Answers (1)

Related Questions