How to detect univariate outliers and mark as TRUE or FALSE in new column

Question

I have a dataframe with 30 columns and >10,000 rows.

How can I run an outlier analysis for a set of variables that will return a TRUE if ANY of the variables exceed the particular threshold (for that given variable), or FALSE if the respective outlier thresholds (3SDs) are not met for any of the variables, with the TRUE/FALSE values displaying in a new column?

I have used quantile to find the 3 standard deviation cut-off values for each variable:

i.e.:

quantile(df$a, 0.003, na.rm = T) #and 

quantile(df$a, 0.997, na.rm = T)

say the first value is 2.5 and the upper value is 10.5 for this variable, I then have created a new variable:

df$outliers <- (df$a <- df$a <2.5 | df$a > 10.5)

which gives TRUE values when values in column a are less than 2.5 or greater than 10.5.

What I would like to do, is have df$outliers represent the outlier status for a set of columns, not just one, i.e columns d, e, f, g, l, m etc, which will all have their own threshold values to meet.

What is the best way to do this?

Ronak Shah · Accepted Answer

Let's assume your dataframe is called df and the columns in which you are interested to check outliers are a, b and c (stored in cols). We can use sapply on those columns find out which value lie in the outlier range. This will return a matrix of TRUE/FALSE values indicating if that particular value is an outlier or not. We take rowSums on it and assign value TRUE if any one column has TRUE value in that row or FALSE otherwise.

cols <- c("a", "b", "c")

df$outliers <- rowSums(sapply(df[cols], function(x) 
                       x < quantile(x, 0.003) | x > quantile(x, 0.997))) > 0

df
#             a          b          c random outliers
#1  -0.56047565  1.2240818 -1.0678237      1    FALSE
#2  -0.23017749  0.3598138 -0.2179749      2    FALSE
#3   1.55870831  0.4007715 -1.0260044      3    FALSE
#4   0.07050839  0.1106827 -0.7288912      4    FALSE
#5   0.12928774 -0.5558411 -0.6250393      5    FALSE
#6   1.71506499  1.7869131 -1.6866933      6     TRUE
#7   0.46091621  0.4978505  0.8377870      7    FALSE
#8  -1.26506123 -1.9666172  0.1533731      8     TRUE
#9  -0.68685285  0.7013559 -1.1381369      9    FALSE
#10 -0.44566197 -0.4727914  1.2538149     10     TRUE

data

set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), random = 1:10)

How to detect univariate outliers and mark as TRUE or FALSE in new column

Answers (2)

Related Questions