Reputation: 53
I have a dataframe with 30 columns and >10,000 rows.
How can I run an outlier analysis for a set of variables that will return a TRUE if ANY of the variables exceed the particular threshold (for that given variable), or FALSE if the respective outlier thresholds (3SDs) are not met for any of the variables, with the TRUE/FALSE values displaying in a new column?
I have used quantile to find the 3 standard deviation cut-off values for each variable:
i.e.:
quantile(df$a, 0.003, na.rm = T) #and
quantile(df$a, 0.997, na.rm = T)
say the first value is 2.5 and the upper value is 10.5 for this variable, I then have created a new variable:
df$outliers <- (df$a <- df$a <2.5 | df$a > 10.5)
which gives TRUE values when values in column a are less than 2.5 or greater than 10.5.
What I would like to do, is have df$outliers represent the outlier status for a set of columns, not just one, i.e columns d, e, f, g, l, m etc, which will all have their own threshold values to meet.
What is the best way to do this?
Upvotes: 5
Views: 961
Reputation: 388862
Let's assume your dataframe is called df
and the columns in which you are interested to check outliers are a
, b
and c
(stored in cols
). We can use sapply
on those columns find out which value lie in the outlier range. This will return a matrix of TRUE
/FALSE
values indicating if that particular value is an outlier or not. We take rowSums
on it and assign value TRUE
if any one column has TRUE
value in that row or FALSE
otherwise.
cols <- c("a", "b", "c")
df$outliers <- rowSums(sapply(df[cols], function(x)
x < quantile(x, 0.003) | x > quantile(x, 0.997))) > 0
df
# a b c random outliers
#1 -0.56047565 1.2240818 -1.0678237 1 FALSE
#2 -0.23017749 0.3598138 -0.2179749 2 FALSE
#3 1.55870831 0.4007715 -1.0260044 3 FALSE
#4 0.07050839 0.1106827 -0.7288912 4 FALSE
#5 0.12928774 -0.5558411 -0.6250393 5 FALSE
#6 1.71506499 1.7869131 -1.6866933 6 TRUE
#7 0.46091621 0.4978505 0.8377870 7 FALSE
#8 -1.26506123 -1.9666172 0.1533731 8 TRUE
#9 -0.68685285 0.7013559 -1.1381369 9 FALSE
#10 -0.44566197 -0.4727914 1.2538149 10 TRUE
data
set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10), c = rnorm(10), random = 1:10)
Upvotes: 3
Reputation: 400
In general an observation is an outlier if it is outlier for one or more feature. But I dont know what your dealing with so it could be different you have to find how the problem your working on define an outlier then you can choose the features that are important and the thresholds.
Going back to the first definition you can create your column as intersection of the results of the same process you made for all variables.
However you should avoid doing this manually, so you create a table of all variable's thresholds then create a function that returns trur if the observation is an outlier for at least one variable
Upvotes: 0