Reputation: 623
I'm trying to average my measurement data of grain size classes of sediments in R. I usually have 8 measurements per sample, but sometimes the device failed or the measurements were unreliable (e.g. due to mechanical issues after 3 measurements etc). As a consequence, this measurements look very different from the rest of the same sample, so I figured an outlier test could remove them automatically.
I tried to follow this code using dplyr
:
https://www.r-bloggers.com/combined-outlier-detection-with-dplyr-and-ruler/
but it seems to be a problem that not all groups have the same length. I found also this vector-targeted method
https://stackoverflow.com/a/4788102/8321705
but I don't know how to group my data beforehand and then apply it to each column of each group.
This is a head of my data, a few samples only have 3 repeated measurements. The first 3 numeric columns describe the particle fractions in one way and the last 6 columns in another way:
#my data with unequal group length
test <- structure(list(Sample_name = c("Sediment1140", "Sediment1140",
"Sediment1140", "Sediment1140", "Sediment1140", "Sediment1140",
"Sediment1140", "Sediment1140", "Sediment1141", "Sediment1141",
"Sediment1141", "Sediment1141", "Sediment1141", "Sediment1141",
"Sediment1141", "Sediment1141", "Sediment1142", "Sediment1142",
"Sediment1142", "Sediment1142"), Dx_10_percent = c(228.3413627,
232.9637155, 236.4058197, 235.4124387, 238.2260309, 238.983854,
237.0509773, 234.22402, 245.5622443, 247.1072046, 248.7302949,
247.7311716, 253.6328878, 249.7883614, 245.8217667, 247.047291,
183.5981354, 186.4353531, 184.4024079, 183.9496282), Dx_50_percent = c(464.4559559,
470.4392019, 479.0066087, 474.75933, 478.1515348, 481.8823096,
480.3117339, 476.2827332, 442.6699831, 443.3890093, 442.4344575,
435.0531805, 443.4543899, 447.494161, 434.7639443, 433.3472111,
336.6085081, 340.9353695, 336.0106474, 340.0936298), Dx_90_percent = c(854.6392436,
856.8504381, 879.5524457, 880.468129, 858.3297603, 905.0097741,
879.5146896, 873.8584305, 819.4818726, 816.5296812, 778.9013718,
766.7617116, 770.5702479, 829.0866972, 766.083991, 751.9915196,
656.4105245, 664.7034131, 698.6157344, 718.225128), Microm_001 = c(0.797059348,
0.801571015, 0.734207569, 0.841152063, 0.834553976, 0.75429789,
0.831636299, 1.000633239, 0.713401217, 0.74354612, 0.741372753,
0.841747801, 0.727424775, 0.755804532, 1.163420288, 1.081749441,
0.27579194, 0.483475909, 0.555629788, 0.697398689), Microm_63 = c(0.90472056,
0.944959738, 0.94659555, 0.903114644, 0.96501726, 0.91079578,
0.954569594, 0.987258593, 0.000822487, 0.000571334, 0.000442701,
0.000297749, 0, 0.000259136, 0.000891928, 0.000769421, 0.923900573,
0.793744127, 0.809342888, 0.839719189), Microm_125 = c(11.30751247,
10.58007103, 10.18149105, 10.27507954, 9.833963719, 9.901098909,
9.983293752, 10.11735892, 10.0938523, 9.776308321, 9.483238809,
9.57495155, 8.647488515, 9.280930949, 9.601526818, 9.458248991,
26.14106339, 25.04179051, 25.88123946, 25.37747955), Microm_250 = c(42.63011079,
42.40400703, 41.42517791, 41.94288617, 41.87864039, 41.23829484,
41.31751454, 41.63495869, 49.13692679, 49.3879579, 50.23445206,
51.60666791, 51.1727376, 49.11877239, 51.26674279, 52.00409162,
50.48687475, 50.90777981, 50.23482884, 49.24764135), Microm_500 = c(39.88402392,
40.77838926, 41.50718101, 40.65764848, 42.15199068, 40.83119895,
41.73250567, 41.18885416, 35.87934642, 36.00915655, 37.3180857,
35.62608032, 37.41951083, 36.26828346, 35.6623117, 35.71836133,
20.18297131, 20.1836753, 17.33407897, 19.10041196), Microm_1000 = c(4.476572917,
4.49100193, 5.205346918, 5.380119103, 4.335833973, 6.364313631,
5.180480148, 5.070936398, 4.175650784, 4.082459775, 2.222407984,
2.350254673, 2.032838283, 4.575949528, 2.305106476, 1.736779196,
1.98939803, 2.589534343, 5.184880055, 4.737349268)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 19L, 20L, 21L, 22L), class = "data.frame")
My pseudocode looks something like this:
#define function from SO answer
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# using dplyr
out_test <- test %>%
group_by(Sample_name) %>%
apply(2, remove_outliers)
# using base R by
out_test2 <- by(test, test$Sample_name, remove_outliers)
How can I either mark/detect the rows which are significantly different from their parallels of the same sample or remove them directly?
Oh and a bonus question: are 8 samples sufficient to determine an outlier from a statistic's point of view? In my case it is about extreme cases resulted from failed measurements, but so no one else gets mislead.
Upvotes: 1
Views: 1070
Reputation: 13125
use mutate_at
after excluding the grouping variable.
library(dplyr)
test %>%
group_by(Sample_name) %>%
mutate_at(vars(-group_cols()), remove_outliers)
Upvotes: 1