Detect outliers in grouped data.frame columns with varying group length

Question

I'm trying to average my measurement data of grain size classes of sediments in R. I usually have 8 measurements per sample, but sometimes the device failed or the measurements were unreliable (e.g. due to mechanical issues after 3 measurements etc). As a consequence, this measurements look very different from the rest of the same sample, so I figured an outlier test could remove them automatically.

I tried to follow this code using dplyr:

https://www.r-bloggers.com/combined-outlier-detection-with-dplyr-and-ruler/

but it seems to be a problem that not all groups have the same length. I found also this vector-targeted method

https://stackoverflow.com/a/4788102/8321705

but I don't know how to group my data beforehand and then apply it to each column of each group.

This is a head of my data, a few samples only have 3 repeated measurements. The first 3 numeric columns describe the particle fractions in one way and the last 6 columns in another way:

#my data with unequal group length
test <- structure(list(Sample_name = c("Sediment1140", "Sediment1140", 
"Sediment1140", "Sediment1140", "Sediment1140", "Sediment1140", 
"Sediment1140", "Sediment1140", "Sediment1141", "Sediment1141", 
"Sediment1141", "Sediment1141", "Sediment1141", "Sediment1141", 
"Sediment1141", "Sediment1141", "Sediment1142", "Sediment1142", 
"Sediment1142", "Sediment1142"), Dx_10_percent = c(228.3413627, 
232.9637155, 236.4058197, 235.4124387, 238.2260309, 238.983854, 
237.0509773, 234.22402, 245.5622443, 247.1072046, 248.7302949, 
247.7311716, 253.6328878, 249.7883614, 245.8217667, 247.047291, 
183.5981354, 186.4353531, 184.4024079, 183.9496282), Dx_50_percent = c(464.4559559, 
470.4392019, 479.0066087, 474.75933, 478.1515348, 481.8823096, 
480.3117339, 476.2827332, 442.6699831, 443.3890093, 442.4344575, 
435.0531805, 443.4543899, 447.494161, 434.7639443, 433.3472111, 
336.6085081, 340.9353695, 336.0106474, 340.0936298), Dx_90_percent = c(854.6392436, 
856.8504381, 879.5524457, 880.468129, 858.3297603, 905.0097741, 
879.5146896, 873.8584305, 819.4818726, 816.5296812, 778.9013718, 
766.7617116, 770.5702479, 829.0866972, 766.083991, 751.9915196, 
656.4105245, 664.7034131, 698.6157344, 718.225128), Microm_001 = c(0.797059348, 
0.801571015, 0.734207569, 0.841152063, 0.834553976, 0.75429789, 
0.831636299, 1.000633239, 0.713401217, 0.74354612, 0.741372753, 
0.841747801, 0.727424775, 0.755804532, 1.163420288, 1.081749441, 
0.27579194, 0.483475909, 0.555629788, 0.697398689), Microm_63 = c(0.90472056, 
0.944959738, 0.94659555, 0.903114644, 0.96501726, 0.91079578, 
0.954569594, 0.987258593, 0.000822487, 0.000571334, 0.000442701, 
0.000297749, 0, 0.000259136, 0.000891928, 0.000769421, 0.923900573, 
0.793744127, 0.809342888, 0.839719189), Microm_125 = c(11.30751247, 
10.58007103, 10.18149105, 10.27507954, 9.833963719, 9.901098909, 
9.983293752, 10.11735892, 10.0938523, 9.776308321, 9.483238809, 
9.57495155, 8.647488515, 9.280930949, 9.601526818, 9.458248991, 
26.14106339, 25.04179051, 25.88123946, 25.37747955), Microm_250 = c(42.63011079, 
42.40400703, 41.42517791, 41.94288617, 41.87864039, 41.23829484, 
41.31751454, 41.63495869, 49.13692679, 49.3879579, 50.23445206, 
51.60666791, 51.1727376, 49.11877239, 51.26674279, 52.00409162, 
50.48687475, 50.90777981, 50.23482884, 49.24764135), Microm_500 = c(39.88402392, 
40.77838926, 41.50718101, 40.65764848, 42.15199068, 40.83119895, 
41.73250567, 41.18885416, 35.87934642, 36.00915655, 37.3180857, 
35.62608032, 37.41951083, 36.26828346, 35.6623117, 35.71836133, 
20.18297131, 20.1836753, 17.33407897, 19.10041196), Microm_1000 = c(4.476572917, 
4.49100193, 5.205346918, 5.380119103, 4.335833973, 6.364313631, 
5.180480148, 5.070936398, 4.175650784, 4.082459775, 2.222407984, 
2.350254673, 2.032838283, 4.575949528, 2.305106476, 1.736779196, 
1.98939803, 2.589534343, 5.184880055, 4.737349268)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 19L, 20L, 21L, 22L), class = "data.frame")

My pseudocode looks something like this:

#define function from SO answer
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# using dplyr
out_test <- test %>%
  group_by(Sample_name) %>%
  apply(2, remove_outliers)

# using base R by
out_test2 <- by(test, test$Sample_name, remove_outliers)

How can I either mark/detect the rows which are significantly different from their parallels of the same sample or remove them directly?

Oh and a bonus question: are 8 samples sufficient to determine an outlier from a statistic's point of view? In my case it is about extreme cases resulted from failed measurements, but so no one else gets mislead.

A. Suliman · Accepted Answer

use mutate_at after excluding the grouping variable.

library(dplyr)
test %>%
     group_by(Sample_name) %>% 
     mutate_at(vars(-group_cols()), remove_outliers)

Detect outliers in grouped data.frame columns with varying group length

Answers (1)

Related Questions