wesleysc352
wesleysc352

Reputation: 617

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.

I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/

I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame

library(rstatix)
library(dplyr)

df <- data.frame(
  sample = 1:20,
  score = c(rnorm(19, mean = 5, sd = 2), 50))

View(df)

out_df<-identify_outliers(df$score)#identify outliers

df2<-df#copy df

df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2

View(df2)

Upvotes: 2

Views: 5667

Answers (2)

Alejo
Alejo

Reputation: 325

A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers. Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:

# Generate a demo data
set.seed(123)
demo.data <- data.frame(
                         sample = 1:20,
                         score = c(rnorm(19, mean = 5, sd = 2), 50),
                         gender = rep(c("Male", "Female"), each = 10)
                        )
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score)) 

# remove them from your dataframe
df2 = demo.data[-outliers,]

Do a cooler function that returns to you the index of the outliers:

get_outliers = function(x){
   which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}

outliers <- get_outliers(demo.data$score)


df2 = demo.data[-outliers,]

Upvotes: 1

akrun
akrun

Reputation: 887741

The identify_outliers expect a data.frame as input i.e. usage is

identify_outliers(data, ..., variable = NULL)

where

... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.

df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)

Upvotes: 3

Related Questions