Reputation: 73
I have the following data frame named as "train". columns bflag and zfactor are factor and the other 2 columns are numeric. I want to replace missing values of the factor columns with mode and the missing values of numeric variables with mean in the same data frame. How i can do this in R?
ID bflag vcount zfactor vnumber
1 0 12 1 12
2 1 NA 0 8
3 0 3 0 9
4 1 13 0 NA
5 1 2 1 2
6 NA 10 NA NA
Upvotes: 1
Views: 4625
Reputation: 20095
The dplyr::mutate_if
will help to decide type of column and function/operation (mode/mean)
that is needed for that column. The solution will be:
library(dplyr)
df %>% mutate_if(is.numeric, funs(replace(.,is.na(.), mean(., na.rm = TRUE)))) %>%
mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))
# ID bflag vcount zfactor vnumber
# 1 1 0 12 1 12.00
# 2 2 1 8 0 8.00
# 3 3 0 3 0 9.00
# 4 4 1 13 0 7.75
# 5 5 1 2 1 2.00
# 6 6 1 10 0 7.75
Note: The Mode
function has been taken from @RichScriven
answer. Link for Mode
function is at (Is there a built-in function for finding the mode?)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Upvotes: 3
Reputation: 99361
In base R you can iterate over the columns and use a simple if
statement. We will have to define a function for the mode, since base R does not provide one.
df[-1] <- lapply(df[-1], function(x) {
if(is.factor(x)) replace(x, is.na(x), Mode(na.omit(x)))
else if(is.numeric(x)) replace(x, is.na(x), mean(x, na.rm=TRUE))
else x
})
df
# ID bflag vcount zfactor vnumber
# 1 1 0 12 1 12.00
# 2 2 1 8 0 8.00
# 3 3 0 3 0 9.00
# 4 4 1 13 0 7.75
# 5 5 1 2 1 2.00
# 6 6 1 10 0 7.75
Data and Mode
function:
df <- read.table(text = "ID bflag vcount zfactor vnumber
1 0 12 1 12
2 1 NA 0 8
3 0 3 0 9
4 1 13 0 NA
5 1 2 1 2
6 NA 10 NA NA",
colClasses = rep(c("numeric", "factor"), length.out=5),
header = TRUE)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode
borrowed from Is there a built-in function for finding the mode?
Upvotes: 3