Terrence J
Terrence J

Reputation: 181

Differences in quantile function

I am struggling with some strange behaviour in R, with the quantile function.

I have two sets of numeric data, and a custom boxplot stats function (which someone helped me write, so I am actually not too sure about every detail):

sample_lang = c(91, 122,  65,  90,  90, 102,
            98,  94,  84,  86, 108, 104,
            94, 110, 100,  86,  92,  92,
            124, 108,  82,  65, 102,  90, 114,
            88,  68, 112,  96,  84,  92,
            80, 104, 114, 112, 108,  68,
            92,  68,  63, 112, 116)

sample_vocab = c(96, 136,  81,  92,  95,
                 112, 101,  95,  97,  94,
                 117,  95, 111, 115,  88,
                 92, 108,  81, 130, 106,  
                 91,  95, 119, 103, 132, 103,
                 65, 114, 107, 108,  86, 
                 100,  98, 111, 123, 123, 117,
                 82, 100,  97,  89, 132, 114)

my.boxplot.stats <- function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) {
  if (coef < 0) 
    stop("'coef' must not be negative")
  nna <- !is.na(x)
  n <- sum(nna)
  #stats <- stats::fivenum(x, na.rm = TRUE)
  stats <- quantile(x, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm = TRUE)
  iqr <- diff(stats[c(2, 4)])
  if (coef == 0) 
    do.out <- FALSE
  else {
    out <- if (!is.na(iqr)) {
      x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                                            iqr)
    }
    else !is.finite(x)
    if (any(out[nna], na.rm = TRUE)) 
      stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
  }
  conf <- if (do.conf) 
    stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
  list(stats = stats, n = n, conf = conf, out = if (do.out) x[out & 
                                                                nna] else numeric())
}

However, when I call quantile and my.boxplot.stats on the same set of data, I am getting different quantile results for the sample_vocab data (but it appears consistent with the sample_lang data), and I am not sure why:

> quantile(sample_vocab, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm=TRUE)
  15%   25%   50%   75%   85% 
 89.6  94.5 101.0 114.0 118.4 
> 
> my.boxplot.stats(sample_vocab)
$stats
  15%   25%   50%   75%   85% 
 81.0  94.5 101.0 114.0 136.0 

Could someone help me understand what is happening? Please note, I am reasonably experienced with programming, but have no formal training in R, I am learning on my own.

Thanks so much in advance!

Upvotes: 0

Views: 290

Answers (1)

josliber
josliber

Reputation: 44330

The relevant bit of code is right here:

  if (coef == 0) 
    do.out <- FALSE
  else {
    out <- if (!is.na(iqr)) {
      x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                                            iqr)
    }
    else !is.finite(x)
    if (any(out[nna], na.rm = TRUE)) 
      stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
  }

Basically, if coef != 0 (in your case coef is 1.5, the default function parameter), then the first and last elements of the reported quantiles are replaced with the minimum and maximum data value within coef * iqr of the 25% and 75% quantiles, where iqr is the distance between those quantiles.

Upvotes: 1

Related Questions