Reputation: 101
I have a dataset with 498 variables of various kinds numeric, logical, date and others and I have this as a data frame in R with rows for observations and columns for variables. There is a certain subset of these variables for which I would like to replace their missing values with the mean for that variable.
I have coded this very simple function for mean imputation:
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
And this works beautifully if I apply to an individual variable say dataset$variableA:
dataset$variableA <- impute.mean(dataset$variableA)
And doing that gives me exactly what I want for the one variable, but because I have a fairly large subset of variables for which I need to do this, I would not want to do this manually by going through each variable that needs imputation.
My first instinct was to use one of the apply functions in R to do this efficiently, however I don't seem to understand how to do this exactly.
A rough first attempt was to use the standard apply:
newdataset <- apply(dataset, 2, impute.mean)
This is obviously a bit crude since it tries to apply the function to all columns including variables which are not numeric, however it seemed like a reasonable starting place even if it might generate a number of warnings. Alas, this method did not work and all my variables remain the same.
I have also done some experimenting with lapply, mapply, ddply but without any success so far.
Ideally, I would like to be able to do something like this:
relevantVariables <- c("variableA1", "variableA2", ..., "variableA293")
newdataset <- magical.apply(dataset, relevantVariables, impute.mean)
Is there some apply function that works in this manner?
Alternatively, is there some other efficient way of going about this?
Upvotes: 4
Views: 1587
Reputation: 15793
You can do this efficiently with the data.table package:
SetNAsToMean <- function(dt, vars) {
# Sets NA values of columns to the column means
#
# Args:
# dt: data.table object to work with
# vars: vector of column names to replace NAs
#
# Returns:
# Nothing. Alters data.table in place.
#
# Example:
# dt <- data.table(num1 = c(1, NA, 3),
# num2 = c(NA, NA, 4),
# char1 = rep("a", 3))
# SetNAsToMean(dt, c("num1", "num2"))
# # Alternatively, set all numeric columns
# numerics <- which(lapply(dt, class) == "numeric")
# SetNAsToMean(dt, numerics)
require(data.table)
for (var in vars) {
set(dt, which(is.na(dt[[var]])), var, mean(dt[[var]], na.rm=T))
}
}
Upvotes: 1
Reputation: 955
Would that satisfy you ?
for (j in 1:length(dataset[1,]))
{
if (is.numeric(dataset[,j]))
{
for(k in 1:length(dataset[,1]))
{
if(is.na(dataset[k,j]))
{
dataset[k,j] <- mean(dataset[,j],na.rm=T)
}
}
}
}
Upvotes: 0