Reputation: 181
I am struggling with some strange behaviour in R, with the quantile function.
I have two sets of numeric data, and a custom boxplot stats function (which someone helped me write, so I am actually not too sure about every detail):
sample_lang = c(91, 122, 65, 90, 90, 102,
98, 94, 84, 86, 108, 104,
94, 110, 100, 86, 92, 92,
124, 108, 82, 65, 102, 90, 114,
88, 68, 112, 96, 84, 92,
80, 104, 114, 112, 108, 68,
92, 68, 63, 112, 116)
sample_vocab = c(96, 136, 81, 92, 95,
112, 101, 95, 97, 94,
117, 95, 111, 115, 88,
92, 108, 81, 130, 106,
91, 95, 119, 103, 132, 103,
65, 114, 107, 108, 86,
100, 98, 111, 123, 123, 117,
82, 100, 97, 89, 132, 114)
my.boxplot.stats <- function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) {
if (coef < 0)
stop("'coef' must not be negative")
nna <- !is.na(x)
n <- sum(nna)
#stats <- stats::fivenum(x, na.rm = TRUE)
stats <- quantile(x, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
if (coef == 0)
do.out <- FALSE
else {
out <- if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef *
iqr)
}
else !is.finite(x)
if (any(out[nna], na.rm = TRUE))
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
}
conf <- if (do.conf)
stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
list(stats = stats, n = n, conf = conf, out = if (do.out) x[out &
nna] else numeric())
}
However, when I call quantile
and my.boxplot.stats
on the same set of data, I am getting different quantile results for the sample_vocab
data (but it appears consistent with the sample_lang
data), and I am not sure why:
> quantile(sample_vocab, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm=TRUE)
15% 25% 50% 75% 85%
89.6 94.5 101.0 114.0 118.4
>
> my.boxplot.stats(sample_vocab)
$stats
15% 25% 50% 75% 85%
81.0 94.5 101.0 114.0 136.0
Could someone help me understand what is happening? Please note, I am reasonably experienced with programming, but have no formal training in R, I am learning on my own.
Thanks so much in advance!
Upvotes: 0
Views: 290
Reputation: 44330
The relevant bit of code is right here:
if (coef == 0)
do.out <- FALSE
else {
out <- if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef *
iqr)
}
else !is.finite(x)
if (any(out[nna], na.rm = TRUE))
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
}
Basically, if coef != 0
(in your case coef
is 1.5, the default function parameter), then the first and last elements of the reported quantiles are replaced with the minimum and maximum data value within coef * iqr
of the 25% and 75% quantiles, where iqr
is the distance between those quantiles.
Upvotes: 1