Reputation: 9
Group | ExamScore1 | ExamScore2 | ExamScore3 | ExamScore4 |
---|---|---|---|---|
A | 68 | 84 | 19 | 95 |
B | 68 | 83 | 28 | 92 |
B | 68 | 92 | 38 | 83 |
C | 78 | 84 | 38 | 94 |
C | 94 | 85 | 28 | 82 |
C | 94 | 92 | 38 | 38 |
B | 48 | 83 | 83 | 38 |
B | 38 | 19 | 48 | 29 |
C | 29 | 23 | 91 | 12 |
A | 48 | 34 | 92 | 39 |
A | 95 | 58 | 93 | 48 |
Above is a data frame, df derived from a larger data frame x, where students are split into Group A,B, or C and do quadruple exams. I would like to do the following:
Identify which student have outliers test scores (using interquartile range method) in Group A, Group B, and Group C individually (I already wrote a code for this kind of).
df1 <- df %>%
group_by(x.Group) %>%
filter(!x.score %in% boxplot.stats(x.score)$out) %>%
ungroup()
Then, I would like to remove students who had outlier scores in 2 or more exam. So for example, if one student in Group A had an outlier score in ExamScore1 and ExamScore3 that student would be removed from the dataframe.
After all the outliers have been removed, I want the data put into a new dataframe df2
Any thoughts on how to go about this? Thank you in advace
Upvotes: 0
Views: 81
Reputation: 76402
Here is a way. Get the number of outliers per group on each exam, bind with the original data set and filter by outliers count. In the end, remove the outliers column from the result df1
.
df<-'Group ExamScore1 ExamScore2 ExamScore3 ExamScore4
A 68 84 19 95
B 68 83 28 92
B 68 92 38 83
C 78 84 38 94
C 94 85 28 82
C 94 92 38 38
B 48 83 83 38
B 38 19 48 29
C 29 23 91 12
A 48 34 92 39
A 95 58 93 48'
df <- read.table(textConnection(df), header = TRUE)
suppressPackageStartupMessages(
library(dplyr)
)
df1 <- bind_cols(
df,
df %>%
group_by(Group) %>%
mutate(across(starts_with("ExamScore"), \(x) x %in% boxplot.stats(x)$out)) %>%
ungroup() %>%
rowwise() %>%
mutate(outliers = sum(c_across(cols = starts_with("ExamScore")))) %>%
select(outliers)
) %>%
filter(outliers < 2)
df1
#> Group ExamScore1 ExamScore2 ExamScore3 ExamScore4 outliers
#> 1 A 68 84 19 95 0
#> 2 B 68 83 28 92 0
#> 3 B 68 92 38 83 0
#> 4 C 78 84 38 94 0
#> 5 C 94 85 28 82 0
#> 6 C 94 92 38 38 0
#> 7 B 48 83 83 38 0
#> 8 B 38 19 48 29 0
#> 9 C 29 23 91 12 0
#> 10 A 48 34 92 39 0
#> 11 A 95 58 93 48 0
df1 <- df1 %>% select(-outliers)
Created on 2022-10-23 with reprex v2.0.2
Upvotes: 2