Iulia Brezoi
Iulia Brezoi

Reputation: 3

What function in R can I use to find a specific value?

Okay so I have a cvs file like this:

proiect<-read.csv("proiect.csv",header=T, sep=",")

where there are date, price for Dior and volume for Dior. I discovered the outliesers of this serie like this:

outlier_DiorVolum<-boxplot.stats(frameProiect$proiect.Volum_Dior)$out
outlier_DiorVolum

And now I have to find the days where this outlieres are. What can I use?

Upvotes: 0

Views: 43

Answers (1)

r2evans
r2evans

Reputation: 161155

The immediate attempt can use %in%, something like

boxplot.stats(iris$Sepal.Width)
# $stats
# [1] 2.2 2.8 3.0 3.3 4.0
# $n
# [1] 150
# $conf
# [1] 2.935497 3.064503
# $out
# [1] 4.4 4.1 4.2 2.0
iris[ iris$Sepal.Width %in% boxplot.stats(iris$Sepal.Width)$out, ]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 16          5.7         4.4          1.5         0.4     setosa
# 33          5.2         4.1          1.5         0.1     setosa
# 34          5.5         4.2          1.4         0.2     setosa
# 61          5.0         2.0          3.5         1.0 versicolor

(or some dplyr or data.table variant).

But while I cannot find an example that proves my point here, theoretically in R and other programming languages, equality of floating-point is hard to do well (consistently, that is; this is related to R FAQ 7.31 and IEEE-754). For that, one would need to find the values within tolerance. This should not be hard to do within reason, but it is certainly more code than above.

But perhaps a better method would be to record the indices immediately within boxplot.stats and use them instead of over-engineering things. Yes, this is more code, but it also does less work and does it the same way as boxplot.stats.

boxplot.stats2 <- function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) {
  if (coef < 0) 
    stop("'coef' must not be negative")
  nna <- !is.na(x)
  n <- sum(nna)
  stats <- stats::fivenum(x, na.rm = TRUE)
  iqr <- diff(stats[c(2, 4)])
  if (coef == 0) {
    do.out <- FALSE
  } else {
    out <-
      if (!is.na(iqr)) {
        x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
      } else !is.finite(x)
    if (any(out[nna], na.rm = TRUE)) {
      stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
    }
  }
  conf <- if (do.conf) stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
  list(stats = stats, n = n, conf = conf,
       out = if (do.out) x[out & nna] else numeric(),
       out.ind = if (do.out) which(out) else integer())  ### NEW
}

(The only changes to the original boxplot.stats function is the addition of out.ind in the return value. Everything we need is already included.)

bp <- boxplot.stats2(iris$Sepal.Width)
bp
# $stats
# [1] 2.2 2.8 3.0 3.3 4.0
# $n
# [1] 150
# $conf
# [1] 2.935497 3.064503
# $out
# [1] 4.4 4.1 4.2 2.0
# $out.ind
# [1] 16 33 34 61

and you can use $out.ind directly (iris[bp$out.ind,]).

Upvotes: 2

Related Questions