Reputation: 37
I hope this finds you safe and well.
I have two complementary datasets, one which has the time series of concentration change values (Timeseries) and the other which has the average value of that time series (MeanConcentration).
I would like to identify outliers based on 3 Median Absolute Deviations for each variable in the MeanConcentration Dataset. Firstly, I would like to figure out what the ID and associated variable is for each of the outliers detected. This will allow me to first check by hand if this is indeed an artifact and should be removed. Then, I would like to create a function which removes these outliers.
I would like to then apply the same exclusion criteria to the time series data (so if we identified participant A for Variable A is excluded in Dataset 1 I also want to exclude that for dataset 2). For the time series data I want to assess Median Absolute Deviation based on the average value from 5 to 9 seconds (to make it complementary to the Mean Concentration Dataset). Note the MADs will also have to be grouped by Chromophore, Condition, and ROI.
MeanConcentration<-as.data.frame(ID = c(1,2,3,4,5), Happy_HbO_LeftParietal_Value = c(0.239005609756098,
-0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512
), Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585,
-0.0456078780487805, -0.29708887804878, 0.109126317073171), Happy_HbO_LeftSTC_Value = c(5.66059024390244,
-2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463
), Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878,
-1.06818609756098, 0.636765365853659, -0.609962195121951), Happy_HbO_LeftDLPFC_Value = c(2.30691146341463,
0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341
))
##TimeSeries Data Frame Example ##
ID time Condition Chromophore ROI Value
<chr> <dbl> <fct> <fct> <fct> <dbl>
1 1 -2 Happy HbO LeftParietal 0.848
2 1 -2 Happy HHb LeftParietal -0.243
3 1 -2 Happy HbO RightParietal 3.80
4 1 -2 Happy HHb RightParietal -0.289
5 1 -2 Happy HbO LeftSTC 2.15
6 1 -2 Happy HHb LeftSTC -1.26
Upvotes: 0
Views: 433
Reputation: 4344
I am not sure I understood you problem fully but this should be pretty close to what you are looking for (just comment what is wrong/missing and I update the answer accordingly):
library(dplyr)
library(tidyr)
library(data.table) # to read in plain text as table
MeanConcentration <- data.frame(ID = c(1,2,3,4,5),
Happy_HbO_LeftParietal_Value = c(0.239005609756098, -0.812496292682927, -1.03227064146341, 0.469810975609756, -0.456419951219512),
Happy_HbO_RightParietal_Value = c(-1.97862195121951, -0.0803191658536585, -0.0456078780487805, -0.29708887804878, 0.109126317073171),
Happy_HbO_LeftSTC_Value = c(5.66059024390244, -2.49184243902439, -0.876321414634146, 1.44561070731707, 0.0991754146341463),
Happy_HbO_RightSTC_Value = c(0.0138107658536585, 0.829429967804878, -1.06818609756098, 0.636765365853659, -0.609962195121951),
Happy_HbO_LeftDLPFC_Value = c(2.30691146341463, 0.749746341463415, 2.60103658536585, 0.870573414634146, -1.73371634146341 ))
TS <- data.table::fread("ID time Condition Chromophore ROI Value
1 -2 Happy HbO LeftParietal 0.848
1 -2 Happy HHb LeftParietal -0.243
1 -2 Happy HbO RightParietal 3.80
1 -2 Happy HHb RightParietal -0.289
1 -2 Happy HbO LeftSTC 2.15
1 -2 Happy HHb LeftSTC -1.26")
OULIERS <- MeanConcentration %>%
# convert data so column names become variables and we have all variable values in one column
tidyr::pivot_longer(cols = -ID, names_to = "variable", values_to = "values") %>%
# split up the colum of the variable names (you get a warning here as the process will generate a 4th column of the word "value" which is mentioned and therefore gets dropped
tidyr::separate(variable, c("Condition", "Chromophore", "ROI"), sep = "_") %>%
# group by the 3 parts of the variable (same as grouping just per variable without splitting)
dplyr::group_by(Condition, Chromophore, ROI) %>%
# make a new column for media and mad - now check if value outside of median +- 3 MAD of define it as an outlier
dplyr::mutate(MEDIAN = median(values, na.rm = TRUE),
MAD = mad(values, na.rm = TRUE),
OUTLIER = ifelse(values > MEDIAN + 3 * MAD | values < MEDIAN - 3 * MAD, "YES", "NO")) %>%
# ungroup (not necessary but recomended)
dplyr::ungroup() %>%
# get only the outliers
dplyr::filter(OUTLIER == "YES")
# print the outliers for inspection
OULIERS
ID Condition Chromophore ROI values MEDIAN MAD OUTLIER
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 1 Happy HbO RightParietal -1.98 -0.0803 0.281 YES
# remove outliers by combo of the 3 columns (possibly you want to include ID here?)
TS %>%
dplyr::anti_join(OULIERS, by = c("Condition", "Chromophore", "ROI"))
ID time Condition Chromophore ROI Value
1: 1 -2 Happy HbO LeftParietal 0.848
2: 1 -2 Happy HHb LeftParietal -0.243
3: 1 -2 Happy HHb RightParietal -0.289
4: 1 -2 Happy HbO LeftSTC 2.150
5: 1 -2 Happy HHb LeftSTC -1.260
Upvotes: 1