BallerNacken
BallerNacken

Reputation: 355

find outliers and set them NA

I have a fairly large vector of about 4 million rows. The problem is that an external source altered the sensor data which produced a lot of outliers. I detected more than 90% of those. But now I am left with the last 10% of outliers which I can not find an approriate way to set them to NA. I don't want to delete them, just set them to NA.

enter image description here

This plot shows 100000 values. It does not look like this everywhere in the TS. Sometimes there are no outliers left, sometimes it looks like this. Which means I need an approach where I can find those outliers without setting data to NA that are not outliers.

I tried different packages (tsoutliers for example) without having much of a success.

Is there a package or a method out there that can find all or at least most of the outliers seen in the plot?

Upvotes: 0

Views: 210

Answers (2)

gaut
gaut

Reputation: 5958

In order to define outliers you could first pass a model which will define what are the "normal" values with a certain percent of confidence. This model can be moving average, arima, (here) ets, or many others...

library(fpp2)
dat <- c(1:50,10,52:100)+rnorm(100, sd=5)
fit <- ets(dat) # working with any model example auto.arima(dat)
upper <- fitted(fit) + 1.96*sqrt(fit$sigma2) #1.96 for 95% confidence interval
lower <- fitted(fit) - 1.96*sqrt(fit$sigma2) #1.96 for 95% confidence interval
plot(dat, type="n", ylim=range(lower,upper))
polygon(c(time(dat),rev(time(dat))), c(upper,rev(lower)), 
        col=rgb(0,0,0.6,0.2), border=FALSE)
lines(dat)
lines(fitted(fit),col='red')
out <- (dat < lower | dat > upper)
points(time(dat)[out], dat[out], pch=19)

This will give you a chart where the outliers are identified, and the confidence intervals shown. outlier identification with exponential model You can then remove the outliers like so:

dat[out] <- NA #removing outliers

Please note that how many outliers you find will depend on the model you choose. For example with auto.arima: auto.arima

EDIT: this is based on Rob Hyndman's post here

Upvotes: 1

T. Ciffr&#233;o
T. Ciffr&#233;o

Reputation: 126

It depends on how you define an outlier as Sotos says. Providing that you consider to be an outlier a data out of range mean +- N*standard_dev then it is easy to numerically identify them

Upvotes: 1

Related Questions