Reputation: 608
I have a data-frame like iris
which I'll be splitting into train and test sets. Previously, I could use group_by
and mutate
from dplyr to replace the mean of missing values by the mean of that species within the attribute, for each attribute.
I now realise that correct ML practice requires first splitting into training and test, then I should impute the mean on the test set using the training means.
Can anyone help do this?
If unclear, let's take an example with iris
.
iris0 <- iris
iris0[sample(150,50,replace=FALSE),1] <- NA
iris0[sample(150,50,replace=FALSE),2] <- NA
iris0[sample(150,50,replace=FALSE),3] <- NA
iris0[sample(150,50,replace=FALSE),4] <- NA
set.seed(123)
sampled <- sample(150,50,replace=FALSE)
iris_test <- iris0[sampled,]
iris_train <- iris0[-sampled,]
Now we come to imputation. I can get the means by class in iris_train:
iris_train %>% group_by(Species) %>% summarise_all(mean,na.rm=TRUE)
1 setosa 5.07 3.49 1.5
2 versica 5.89 2.84 4.33
3 virginica 6.56 2.97 5.52
Now, as I said, filling in NAs in iris_train with these means can be done with dplyr (group-by and mutate). But I would like to replace the NAs in iris_test with the same means, i.e. setosa species fill in any NA in the Sepal.Length column with 5.07, versica species with 5.89, virginica fills in any NA in the Sepal.Width column with 2.97 etc.
Only base-R and dplyr please!
Upvotes: 0
Views: 220
Reputation: 388982
We can store the results of mean in iris_mean
, left_join
it with iris_test
and replace the NA
values.
iris_mean <- iris_train %>% group_by(Species) %>% summarise_all(mean,na.rm=TRUE)
temp <- iris_test %>% left_join(iris_mean, by = 'Species')
col1 <- 1:4
col2 <- 6:9
temp[col1][is.na(temp[col1])] <- temp[col2][is.na(temp[col1])]
temp[1:5]
Upvotes: 0