Reputation: 617
I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix
package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
Upvotes: 2
Views: 5667
Reputation: 325
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers. Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]
Upvotes: 1
Reputation: 887741
The identify_outliers
expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
Upvotes: 3