Reputation: 23
As requested by a collaborator, I am trying to create a second version of a dataset with outliers removed. I have data with multiple groups (factors) and multiple numeric response variables. I want to write a function that (1) finds outliers and extremes by group using the 1.5*IQR and 3* IQR methods, (2) counts the outliers, and (3) if the number of outliers is greater than 2, replaces values for EXTREMES only with NA
, but if the number of outliers is less than or equal to 2, replaces values for OUTLIERS with NA
.
Because this is a grouped calculation, I have opted to use dplyr. I am hoping to apply this function to each of the several response variables in my dataset. I have achieved steps (1) and (2) with this method, but am facing the following issues with step (3).
Steps 1 and 2:
require(dplyr)
# Find outliers and extremes for one response variable by group. Mark if number of outliers per group is >2.
# List outliers and extremes.
out_ext_num <- iris %>%
group_by(Species) %>%
mutate(is_outlier = (Sepal.Length < summary(Sepal.Length)[2] - (1.5*IQR(Sepal.Length)) |
(Sepal.Length > ((1.5*IQR(Sepal.Length)) + summary(Sepal.Length)[5]))),
is_extreme = (Sepal.Length < (summary(Sepal.Length)[2] - (3*IQR(Sepal.Length)))) |
(Sepal.Length > ((3*IQR(Sepal.Length)) + summary(Sepal.Length)[5]))) %>%
summarise(out_num2 = sum(is_outlier) > 2, outliers = list(Sepal.Length[is_outlier == T]),
extremes = list(Sepal.Length[is_extreme == T]))
# A tibble: 3 x 4
Species out_num2 outliers extremes
<fct> <lgl> <list> <list>
1 setosa FALSE <dbl [0]> <dbl [0]>
2 versicolor FALSE <dbl [0]> <dbl [0]>
3 virginica FALSE <dbl [1]> <dbl [0]>
I can combine this with a nested version of my data so that the groups line up: (EDITED)
nested <- iris %>%
select(Species, Sepal.Length) %>%
group_by(Species) %>%
nest() %>%
left_join(out_ext_num)
# A tibble: 3 x 5
# Groups: Species [3]
Species data out_num2 outliers extremes
<fct> <list> <lgl> <list> <list>
1 setosa <tibble [50 x 1]> FALSE <dbl [0]> <dbl [0]>
2 versicolor <tibble [50 x 1]> FALSE <dbl [0]> <dbl [0]>
3 virginica <tibble [50 x 1]> FALSE <dbl [1]> <dbl [0]>
Now, for each group, if out_num2
is FALSE
, I want to match values from the data
tibble (containing vector Sepal.Length
) with values from the outliers
list. If they match, I want to replace that value in the tibble with NA
. If out_num2
is TRUE
, I want to match values from the data
tibble with values from the extremes
list and replace corresponding values in the tibble with NA
. Right now, I can't even match values from the list with the tibble, period, let alone do it conditionally. I have tried using %in%
within mutate()
and map()
, but sense that I am incorrectly referencing the Sepal.Length
vector within the tibble:
require(purrr)
nested %>%
mutate(Sepal.Length.o = map(data, ~ ifelse(Sepal.Length[.x %in% nested$outliers], NA, Sepal.Length)))
I've reviewed list referencing and several posts on matching values from different sites, but nothing quite works with this dplyr format. Is there another or better way to do this?
After figuring out how to match, I was planning to use nested ifelse
statements to conditionally replace outlier or extreme values with NA
, perhaps similar to this:
nested %>%
mutate(Sepal.Length.o = ifelse(out_num2 == T,
ifelse(match_tibble_with_extremes, NA, Sepal.Length),
ifelse(match_tibble_with_outliers, NA, Sepal.Length)))
But I'm not sure if that will work as I intend. Any help, especially with the matching step, would be greatly appreciated.
Upvotes: 2
Views: 215
Reputation: 887173
Here is one option
library(dplyr)
library(purrr)
nested %>%
rowwise %>%
mutate(data = map2(data, if(out_num2) list(extremes) else list(outliers), ~ replace(.x, .x %in% .y, NA)))
Upvotes: 1