Reputation: 355
I have a fairly large vector of about 4 million rows. The problem is that an external source altered the sensor data which produced a lot of outliers. I detected more than 90% of those. But now I am left with the last 10% of outliers which I can not find an approriate way to set them to NA. I don't want to delete them, just set them to NA.
This plot shows 100000 values. It does not look like this everywhere in the TS. Sometimes there are no outliers left, sometimes it looks like this. Which means I need an approach where I can find those outliers without setting data to NA that are not outliers.
I tried different packages (tsoutliers for example) without having much of a success.
Is there a package or a method out there that can find all or at least most of the outliers seen in the plot?
Upvotes: 0
Views: 210
Reputation: 5958
In order to define outliers you could first pass a model which will define what are the "normal" values with a certain percent of confidence. This model can be moving average, arima
, (here) ets
, or many others...
library(fpp2)
dat <- c(1:50,10,52:100)+rnorm(100, sd=5)
fit <- ets(dat) # working with any model example auto.arima(dat)
upper <- fitted(fit) + 1.96*sqrt(fit$sigma2) #1.96 for 95% confidence interval
lower <- fitted(fit) - 1.96*sqrt(fit$sigma2) #1.96 for 95% confidence interval
plot(dat, type="n", ylim=range(lower,upper))
polygon(c(time(dat),rev(time(dat))), c(upper,rev(lower)),
col=rgb(0,0,0.6,0.2), border=FALSE)
lines(dat)
lines(fitted(fit),col='red')
out <- (dat < lower | dat > upper)
points(time(dat)[out], dat[out], pch=19)
This will give you a chart where the outliers are identified, and the confidence intervals shown. You can then remove the outliers like so:
dat[out] <- NA #removing outliers
Please note that how many outliers you find will depend on the model you choose. For example with auto.arima
:
EDIT: this is based on Rob Hyndman's post here
Upvotes: 1
Reputation: 126
It depends on how you define an outlier as Sotos says. Providing that you consider to be an outlier a data out of range mean +- N*standard_dev then it is easy to numerically identify them
Upvotes: 1