sahboor
sahboor

Reputation: 73

replace missing with mode for factor column and mean for numeric column in r

I have the following data frame named as "train". columns bflag and zfactor are factor and the other 2 columns are numeric. I want to replace missing values of the factor columns with mode and the missing values of numeric variables with mean in the same data frame. How i can do this in R?

ID   bflag  vcount zfactor vnumber
1     0       12      1       12
2     1       NA      0       8
3     0       3       0       9
4     1       13      0       NA
5     1       2       1       2
6     NA      10      NA      NA

Upvotes: 1

Views: 4625

Answers (2)

MKR
MKR

Reputation: 20095

The dplyr::mutate_if will help to decide type of column and function/operation (mode/mean) that is needed for that column. The solution will be:

library(dplyr)
df %>% mutate_if(is.numeric, funs(replace(.,is.na(.), mean(., na.rm = TRUE)))) %>%
  mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))

#   ID bflag vcount zfactor vnumber
# 1  1     0     12       1   12.00
# 2  2     1      8       0    8.00
# 3  3     0      3       0    9.00
# 4  4     1     13       0    7.75
# 5  5     1      2       1    2.00
# 6  6     1     10       0    7.75

Note: The Mode function has been taken from @RichScriven answer. Link for Mode function is at (Is there a built-in function for finding the mode?)

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Upvotes: 3

Rich Scriven
Rich Scriven

Reputation: 99361

In base R you can iterate over the columns and use a simple if statement. We will have to define a function for the mode, since base R does not provide one.

df[-1] <- lapply(df[-1], function(x) {
    if(is.factor(x)) replace(x, is.na(x), Mode(na.omit(x)))
    else if(is.numeric(x)) replace(x, is.na(x), mean(x, na.rm=TRUE))
    else x
})

df
#   ID bflag vcount zfactor vnumber
# 1  1     0     12       1   12.00
# 2  2     1      8       0    8.00
# 3  3     0      3       0    9.00
# 4  4     1     13       0    7.75
# 5  5     1      2       1    2.00
# 6  6     1     10       0    7.75

Data and Mode function:

df <- read.table(text = "ID   bflag  vcount zfactor vnumber
1     0       12      1       12
2     1       NA      0       8
3     0       3       0       9
4     1       13      0       NA
5     1       2       1       2
6     NA      10      NA      NA", 
colClasses = rep(c("numeric", "factor"), length.out=5), 
header = TRUE)

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

Mode borrowed from Is there a built-in function for finding the mode?

Upvotes: 3

Related Questions