Reputation: 1
I have a large dataset that I’m trying to get its outliers for each variable in order to filter them out.
For a single variable in the dataset normally I’d use:
> dataset$variable <- !dataset$variable %in% boxplot.stats(dataset$variable)$out
this however doesn't work for a large dataset with a variety of data types. my first attempt to overcome this was to use:
map(dataset, boxplot.stats)
which subsequently created a list of stats for each variable, that I failed to extract the outliers from.
Any suggestions on how to get around this, and implement what I did for a single variable to the whole dataset ?
Upvotes: 0
Views: 97
Reputation: 11046
You should provide reproducible data using dput()
in the future. Since you did not, I will use the iris
data set that is included with R. Rather than identify values to be removed we will identify the row number of the outliers:
data(iris)
idx <- sapply(iris[, -5], function(x) which(x %in% boxplot.stats(x)$out))
out <- sort(unique(unlist(unname(idx))))
out
# [1] 16 33 34 61
The last column of iris
is the species name so we exclude it from the analysis. Then we identify the row numbers of the outliers in each column. Since you need to remove the entire row, not just the value, we can combine all of the row numbers, remove duplicates and sort the values. Now remove those rows from the data:
dim(iris) # The data set has 150 rows with 5 columns
# [1] 150 5
iris.mod <- iris[-out, ]
dim(iris.mod)
# [1] 146 5 # The modified data set has 146 rows with 5 columns.
If you just want to replace the outliers with NA, that is also possible with a few adjustments to the code above.
parts <- which(sapply(idx, length) > 0)
rowcol <- lapply(parts, function(x) cbind(row=idx[[x]], col=x))
coords <- do.call(rbind, rowcol)
coords
# row col
# [1,] 16 2
# [2,] 33 2
# [3,] 34 2
# [4,] 61 2
iris.na <- iris
iris.na[coords] <- NA
Removing outliers is not always a good idea. There are other ways to deal with the problem such as using robust statistical methods.
Upvotes: 1