azizi tamimi
azizi tamimi

Reputation: 75

how to remove outliers based on standard dev, using tidyverse?

I tried out this code using tidyverse package to filter outliers based on sd.

rt_trimmed_data_Dec = data_Dec %>%
 group_by(Time_of_Testing, Item_Type, Group) %>%
 summarise(RT_mean = mean(RT, na.rm=TRUE), RT_sd = sd(RT, na.rm=TRUE))%>%
 ungroup()  %>%
 mutate(rt_high = RT_mean + (2.5 * RT_sd)) %>%
  mutate(rt_low = RT_mean - (2.5 * RT_sd))

Then, I tried to join the two data frames, to apply the filtering out.

data_Dec_RT = data_Dec %>%
   inner_join(rt_trimmed_data_Dec) %>%
   filter(RT < rt_high) %>%
    filter(RT > rt_low)

But then I got this error

Error: `by` required, because the data sources have no common variables

Call rlang::last_error() to see a backtrace. > rlang::last_error() message: by required, because the data sources have no common variables class: rlang_error backtrace: 1. dplyr::inner_join(., rt_trimmed_data_Dec) 9. dplyr:::common_by.NULL(by, x, y) 11. dplyr:::bad_args("by", "required, because the data sources have no common variables") 12. dplyr:::glubort(fmt_args(args), ..., .envir = .envir) 13. dplyr::inner_join(., rt_trimmed_data_Dec).

Could you please advise on how to solve this issue, I would highly appreciate your help.

Upvotes: 1

Views: 4799

Answers (2)

Tom Beesley
Tom Beesley

Reputation: 85

This is pretty easy to do by z scoring your RT column using scale.

    library(tidyverse)

    samples = 50
    Ps = 10

    # data frame that contains participant numbers, and RT scores
    data <- data.frame(participant = as.factor(rep(1:Ps, each = samples)),
                       RT = rnorm(n = samples*Ps, mean = 600, sd = 50))

    data_noOutliers <- data %>% 
      group_by(participant) %>% 
      mutate(zRT = scale(RT)) %>% 
      filter(between(zRT,-2.5,+2.5))

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 389135

I guess you can do this with

library(dplyr)
data_Dec %>%
  group_by(Time_of_Testing, Item_Type, Group) %>%
  filter(between(RT, mean(RT, na.rm=TRUE) - (2.5 * sd(RT, na.rm=TRUE)), 
                     mean(RT, na.rm=TRUE) + (2.5 * sd(RT, na.rm=TRUE))))

Upvotes: 1

Related Questions