Reputation: 3
Okay so I have a cvs file like this:
proiect<-read.csv("proiect.csv",header=T, sep=",")
where there are date, price for Dior and volume for Dior. I discovered the outliesers of this serie like this:
outlier_DiorVolum<-boxplot.stats(frameProiect$proiect.Volum_Dior)$out
outlier_DiorVolum
And now I have to find the days where this outlieres are. What can I use?
Upvotes: 0
Views: 43
Reputation: 161155
The immediate attempt can use %in%
, something like
boxplot.stats(iris$Sepal.Width)
# $stats
# [1] 2.2 2.8 3.0 3.3 4.0
# $n
# [1] 150
# $conf
# [1] 2.935497 3.064503
# $out
# [1] 4.4 4.1 4.2 2.0
iris[ iris$Sepal.Width %in% boxplot.stats(iris$Sepal.Width)$out, ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 16 5.7 4.4 1.5 0.4 setosa
# 33 5.2 4.1 1.5 0.1 setosa
# 34 5.5 4.2 1.4 0.2 setosa
# 61 5.0 2.0 3.5 1.0 versicolor
(or some dplyr
or data.table
variant).
But while I cannot find an example that proves my point here, theoretically in R and other programming languages, equality of floating-point is hard to do well (consistently, that is; this is related to R FAQ 7.31 and IEEE-754). For that, one would need to find the values within tolerance. This should not be hard to do within reason, but it is certainly more code than above.
But perhaps a better method would be to record the indices immediately within boxplot.stats
and use them instead of over-engineering things. Yes, this is more code, but it also does less work and does it the same way as boxplot.stats
.
boxplot.stats2 <- function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) {
if (coef < 0)
stop("'coef' must not be negative")
nna <- !is.na(x)
n <- sum(nna)
stats <- stats::fivenum(x, na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
if (coef == 0) {
do.out <- FALSE
} else {
out <-
if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
} else !is.finite(x)
if (any(out[nna], na.rm = TRUE)) {
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
}
}
conf <- if (do.conf) stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
list(stats = stats, n = n, conf = conf,
out = if (do.out) x[out & nna] else numeric(),
out.ind = if (do.out) which(out) else integer()) ### NEW
}
(The only changes to the original boxplot.stats
function is the addition of out.ind
in the return value. Everything we need is already included.)
bp <- boxplot.stats2(iris$Sepal.Width)
bp
# $stats
# [1] 2.2 2.8 3.0 3.3 4.0
# $n
# [1] 150
# $conf
# [1] 2.935497 3.064503
# $out
# [1] 4.4 4.1 4.2 2.0
# $out.ind
# [1] 16 33 34 61
and you can use $out.ind
directly (iris[bp$out.ind,]
).
Upvotes: 2