Dan Nguyen
Dan Nguyen

Reputation: 9

How to remove subjects with outliers in multiple+ columns in R

Group ExamScore1 ExamScore2 ExamScore3 ExamScore4
A 68 84 19 95
B 68 83 28 92
B 68 92 38 83
C 78 84 38 94
C 94 85 28 82
C 94 92 38 38
B 48 83 83 38
B 38 19 48 29
C 29 23 91 12
A 48 34 92 39
A 95 58 93 48

Above is a data frame, df derived from a larger data frame x, where students are split into Group A,B, or C and do quadruple exams. I would like to do the following:

Identify which student have outliers test scores (using interquartile range method) in Group A, Group B, and Group C individually (I already wrote a code for this kind of).

df1 <- df %>%
group_by(x.Group) %>%
filter(!x.score %in% boxplot.stats(x.score)$out) %>%
ungroup()

Then, I would like to remove students who had outlier scores in 2 or more exam. So for example, if one student in Group A had an outlier score in ExamScore1 and ExamScore3 that student would be removed from the dataframe.

After all the outliers have been removed, I want the data put into a new dataframe df2

Any thoughts on how to go about this? Thank you in advace

Upvotes: 0

Views: 81

Answers (1)

Rui Barradas
Rui Barradas

Reputation: 76402

Here is a way. Get the number of outliers per group on each exam, bind with the original data set and filter by outliers count. In the end, remove the outliers column from the result df1.

df<-'Group  ExamScore1  ExamScore2  ExamScore3  ExamScore4
A   68  84  19  95
B   68  83  28  92
B   68  92  38  83
C   78  84  38  94
C   94  85  28  82
C   94  92  38  38
B   48  83  83  38
B   38  19  48  29
C   29  23  91  12
A   48  34  92  39
A   95  58  93  48'
df <- read.table(textConnection(df), header = TRUE)

suppressPackageStartupMessages(
  library(dplyr)
)

df1 <- bind_cols(
  df,
  df %>%
    group_by(Group) %>%
    mutate(across(starts_with("ExamScore"), \(x) x %in% boxplot.stats(x)$out)) %>%
    ungroup() %>%
    rowwise() %>%
    mutate(outliers = sum(c_across(cols = starts_with("ExamScore")))) %>%
    select(outliers) 
) %>%
  filter(outliers < 2)

df1
#>    Group ExamScore1 ExamScore2 ExamScore3 ExamScore4 outliers
#> 1      A         68         84         19         95        0
#> 2      B         68         83         28         92        0
#> 3      B         68         92         38         83        0
#> 4      C         78         84         38         94        0
#> 5      C         94         85         28         82        0
#> 6      C         94         92         38         38        0
#> 7      B         48         83         83         38        0
#> 8      B         38         19         48         29        0
#> 9      C         29         23         91         12        0
#> 10     A         48         34         92         39        0
#> 11     A         95         58         93         48        0

df1 <- df1 %>% select(-outliers)

Created on 2022-10-23 with reprex v2.0.2

Upvotes: 2

Related Questions