Identifying the IDs for outliers and Removing Outliers Based on MAD for both the Summarized dataset and Averages of a Time Series

Question

I hope this finds you safe and well.

I have two complementary datasets, one which has the time series of concentration change values (Timeseries) and the other which has the average value of that time series (MeanConcentration).

I would like to identify outliers based on 3 Median Absolute Deviations for each variable in the MeanConcentration Dataset. Firstly, I would like to figure out what the ID and associated variable is for each of the outliers detected. This will allow me to first check by hand if this is indeed an artifact and should be removed. Then, I would like to create a function which removes these outliers.

I would like to then apply the same exclusion criteria to the time series data (so if we identified participant A for Variable A is excluded in Dataset 1 I also want to exclude that for dataset 2). For the time series data I want to assess Median Absolute Deviation based on the average value from 5 to 9 seconds (to make it complementary to the Mean Concentration Dataset). Note the MADs will also have to be grouped by Chromophore, Condition, and ROI.

MeanConcentration<-as.data.frame(ID = c(1,2,3,4,5), Happy_HbO_LeftParietal_Value = c(0.239005609756098, 
-0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512
), Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, 
-0.0456078780487805, -0.29708887804878, 0.109126317073171), Happy_HbO_LeftSTC_Value = c(5.66059024390244, 
-2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463
), Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, 
-1.06818609756098, 0.636765365853659, -0.609962195121951), Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 
0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341
))

##TimeSeries Data Frame Example ##
 ID       time Condition Chromophore ROI            Value
                           
1 1      -2 Happy     HbO         LeftParietal   0.848
2 1     -2 Happy     HHb         LeftParietal  -0.243
3 1     -2 Happy     HbO         RightParietal  3.80 
4 1     -2 Happy     HHb         RightParietal -0.289
5 1      -2 Happy     HbO         LeftSTC        2.15 
6 1      -2 Happy     HHb         LeftSTC       -1.26

DPH · Accepted Answer

I am not sure I understood you problem fully but this should be pretty close to what you are looking for (just comment what is wrong/missing and I update the answer accordingly):

library(dplyr)
library(tidyr)
library(data.table) # to read in plain text as table

    MeanConcentration <- data.frame(ID = c(1,2,3,4,5), 
                                     Happy_HbO_LeftParietal_Value = c(0.239005609756098, -0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512),
                                 Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, -0.0456078780487805, -0.29708887804878, 0.109126317073171), 
                                 Happy_HbO_LeftSTC_Value = c(5.66059024390244, -2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463), 
                                 Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, -1.06818609756098, 0.636765365853659, -0.609962195121951),
                                 Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341 ))


TS <- data.table::fread("ID       time Condition Chromophore ROI            Value
 1      -2 Happy     HbO         LeftParietal   0.848
 1      -2 Happy     HHb         LeftParietal  -0.243
 1      -2 Happy     HbO         RightParietal  3.80 
 1      -2 Happy     HHb         RightParietal -0.289
 1      -2 Happy     HbO         LeftSTC        2.15 
 1      -2 Happy     HHb         LeftSTC       -1.26")

OULIERS <- MeanConcentration %>% 
  # convert data so column names become variables and we have all variable values in one column
  tidyr::pivot_longer(cols = -ID, names_to = "variable", values_to = "values") %>% 
  # split up the colum of the variable names (you get a warning here as the process will generate a 4th column of the word "value" which is mentioned and therefore gets dropped
  tidyr::separate(variable, c("Condition", "Chromophore", "ROI"), sep = "_") %>% 
  # group by the 3 parts of the variable (same as grouping just per variable without splitting)
  dplyr::group_by(Condition, Chromophore, ROI) %>% 
  # make a new column for media and mad - now check if value outside of median +- 3 MAD of define it as an outlier
  dplyr::mutate(MEDIAN = median(values, na.rm = TRUE),
                MAD = mad(values, na.rm = TRUE),
                OUTLIER = ifelse(values > MEDIAN + 3 * MAD | values < MEDIAN - 3 * MAD, "YES", "NO")) %>% 
  # ungroup (not necessary but recomended)
  dplyr::ungroup() %>% 
  # get only the outliers
  dplyr::filter(OUTLIER == "YES")  

# print the outliers for inspection
OULIERS

     ID Condition Chromophore ROI           values  MEDIAN   MAD OUTLIER
                                
1     1 Happy     HbO         RightParietal  -1.98 -0.0803 0.281 YES 

# remove outliers by combo of the 3 columns (possibly you want to include ID here?)
TS %>% 
  dplyr::anti_join(OULIERS, by = c("Condition", "Chromophore", "ROI"))

   ID time Condition Chromophore           ROI  Value
1:  1   -2     Happy         HbO  LeftParietal  0.848
2:  1   -2     Happy         HHb  LeftParietal -0.243
3:  1   -2     Happy         HHb RightParietal -0.289
4:  1   -2     Happy         HbO       LeftSTC  2.150
5:  1   -2     Happy         HHb       LeftSTC -1.260

Identifying the IDs for outliers and Removing Outliers Based on MAD for both the Summarized dataset and Averages of a Time Series

Answers (1)

Related Questions