Epic_Yarin_God
Epic_Yarin_God

Reputation: 107

Imputing multiple columns in R using mutate_at

I have a large medical data frame that I want to use for ML. As such, I have to impute missing values. For the continus variables I would like to put the median value, like so:

dat$First_Wbc <- ifelse(is.na(dat$First_Wbc), median2(dat$First_Wbc), dat$First_Wbc)

I want to create a code using mutate_at that would do the same as the code I provided above, but for multiple variables at a time. I know it's possible but so far I haven't been able to format it correctly. Can you please help me?

Note: median2() is a function identical to median() but it ignores the missing values

Upvotes: 0

Views: 2151

Answers (4)

Giora Simchoni
Giora Simchoni

Reputation: 3689

Speaking of tidy solutions I really like the naniar package, it provides many useful methods for working with missing data.

E.g., here to impute medians in all numeric columns you could do:

library(tidyverse)
library(naniar)

df %>%
  impute_median_if(is.numeric)

More added values comes with impute_median_all(), impute_mean_if() and many great missing data visualizations.

Upvotes: 1

akrun
akrun

Reputation: 887163

We can use mutate_if with na.aggregate

library(dplyr)
library(zoo)
df %>% 
   mutate_if(is.numeric, na.aggregate, FUN = median)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

You can select columns by position :

library(dplyr)
df %>% mutate_at(2:4, ~replace(., is.na(.), median2(.)))

Or by the range of columns

df %>% mutate_at(vars(a:d), ~replace(., is.na(.), median2(.)))

Or using a specific pattern in the column names.

df %>% mutate_at(vars(starts_with('col')), ~replace(., is.na(.), median2(.)))

Upvotes: 2

hello_friend
hello_friend

Reputation: 5788

Base R solution:

dat[,sapply(dat, is.numeric)] <- lapply(dat[,sapply(dat, is.numeric)], 
                                        function(x){
                                         x <- ifelse(is.na(x), median(x, na.rm  = TRUE), x)
                                          }
                                        )

Tidyverse using mutate_if:

library(tidyverse)
df %>% 
  mutate_if(is.numeric, funs(replace(., is.na(.), median(., na.rm = TRUE))))

Upvotes: 2

Related Questions