stat.chat
stat.chat

Reputation: 53

How to detect univariate outliers and mark as TRUE or FALSE in new column

I have a dataframe with 30 columns and >10,000 rows.

How can I run an outlier analysis for a set of variables that will return a TRUE if ANY of the variables exceed the particular threshold (for that given variable), or FALSE if the respective outlier thresholds (3SDs) are not met for any of the variables, with the TRUE/FALSE values displaying in a new column?

I have used quantile to find the 3 standard deviation cut-off values for each variable:

i.e.:

quantile(df$a, 0.003, na.rm = T) #and 

quantile(df$a, 0.997, na.rm = T)

say the first value is 2.5 and the upper value is 10.5 for this variable, I then have created a new variable:

df$outliers <- (df$a <- df$a <2.5 | df$a > 10.5)

which gives TRUE values when values in column a are less than 2.5 or greater than 10.5.

What I would like to do, is have df$outliers represent the outlier status for a set of columns, not just one, i.e columns d, e, f, g, l, m etc, which will all have their own threshold values to meet.

What is the best way to do this?

Upvotes: 5

Views: 961

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388862

Let's assume your dataframe is called df and the columns in which you are interested to check outliers are a, b and c (stored in cols). We can use sapply on those columns find out which value lie in the outlier range. This will return a matrix of TRUE/FALSE values indicating if that particular value is an outlier or not. We take rowSums on it and assign value TRUE if any one column has TRUE value in that row or FALSE otherwise.

cols <- c("a", "b", "c")

df$outliers <- rowSums(sapply(df[cols], function(x) 
                       x < quantile(x, 0.003) | x > quantile(x, 0.997))) > 0

df
#             a          b          c random outliers
#1  -0.56047565  1.2240818 -1.0678237      1    FALSE
#2  -0.23017749  0.3598138 -0.2179749      2    FALSE
#3   1.55870831  0.4007715 -1.0260044      3    FALSE
#4   0.07050839  0.1106827 -0.7288912      4    FALSE
#5   0.12928774 -0.5558411 -0.6250393      5    FALSE
#6   1.71506499  1.7869131 -1.6866933      6     TRUE
#7   0.46091621  0.4978505  0.8377870      7    FALSE
#8  -1.26506123 -1.9666172  0.1533731      8     TRUE
#9  -0.68685285  0.7013559 -1.1381369      9    FALSE
#10 -0.44566197 -0.4727914  1.2538149     10     TRUE

data

set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), random = 1:10)

Upvotes: 3

benaou mouad
benaou mouad

Reputation: 400

In general an observation is an outlier if it is outlier for one or more feature. But I dont know what your dealing with so it could be different you have to find how the problem your working on define an outlier then you can choose the features that are important and the thresholds.

Going back to the first definition you can create your column as intersection of the results of the same process you made for all variables.

However you should avoid doing this manually, so you create a table of all variable's thresholds then create a function that returns trur if the observation is an outlier for at least one variable

Upvotes: 0

Related Questions