Reputation: 107
I have a large medical data frame that I want to use for ML. As such, I have to impute missing values. For the continus variables I would like to put the median value, like so:
dat$First_Wbc <- ifelse(is.na(dat$First_Wbc), median2(dat$First_Wbc), dat$First_Wbc)
I want to create a code using mutate_at that would do the same as the code I provided above, but for multiple variables at a time. I know it's possible but so far I haven't been able to format it correctly. Can you please help me?
Note: median2() is a function identical to median() but it ignores the missing values
Upvotes: 0
Views: 2151
Reputation: 3689
Speaking of tidy solutions I really like the naniar
package, it provides many useful methods for working with missing data.
E.g., here to impute medians in all numeric columns you could do:
library(tidyverse)
library(naniar)
df %>%
impute_median_if(is.numeric)
More added values comes with impute_median_all()
, impute_mean_if()
and many great missing data visualizations.
Upvotes: 1
Reputation: 887163
We can use mutate_if
with na.aggregate
library(dplyr)
library(zoo)
df %>%
mutate_if(is.numeric, na.aggregate, FUN = median)
Upvotes: 1
Reputation: 388982
You can select columns by position :
library(dplyr)
df %>% mutate_at(2:4, ~replace(., is.na(.), median2(.)))
Or by the range of columns
df %>% mutate_at(vars(a:d), ~replace(., is.na(.), median2(.)))
Or using a specific pattern in the column names.
df %>% mutate_at(vars(starts_with('col')), ~replace(., is.na(.), median2(.)))
Upvotes: 2
Reputation: 5788
Base R solution:
dat[,sapply(dat, is.numeric)] <- lapply(dat[,sapply(dat, is.numeric)],
function(x){
x <- ifelse(is.na(x), median(x, na.rm = TRUE), x)
}
)
Tidyverse using mutate_if:
library(tidyverse)
df %>%
mutate_if(is.numeric, funs(replace(., is.na(.), median(., na.rm = TRUE))))
Upvotes: 2