user3874160
user3874160

Reputation: 23

R Dplyr: How to replace items in a vector matching items in a list, conditionally

As requested by a collaborator, I am trying to create a second version of a dataset with outliers removed. I have data with multiple groups (factors) and multiple numeric response variables. I want to write a function that (1) finds outliers and extremes by group using the 1.5*IQR and 3* IQR methods, (2) counts the outliers, and (3) if the number of outliers is greater than 2, replaces values for EXTREMES only with NA, but if the number of outliers is less than or equal to 2, replaces values for OUTLIERS with NA.

Because this is a grouped calculation, I have opted to use dplyr. I am hoping to apply this function to each of the several response variables in my dataset. I have achieved steps (1) and (2) with this method, but am facing the following issues with step (3).

Steps 1 and 2:

require(dplyr)

# Find outliers and extremes for one response variable by group. Mark if number of outliers per group is >2. 
# List outliers and extremes.
out_ext_num <- iris %>%
  group_by(Species) %>%
  mutate(is_outlier = (Sepal.Length < summary(Sepal.Length)[2] - (1.5*IQR(Sepal.Length)) | 
                         (Sepal.Length > ((1.5*IQR(Sepal.Length)) + summary(Sepal.Length)[5]))),
         is_extreme = (Sepal.Length < (summary(Sepal.Length)[2] - (3*IQR(Sepal.Length)))) | 
           (Sepal.Length > ((3*IQR(Sepal.Length)) + summary(Sepal.Length)[5]))) %>%
  summarise(out_num2 = sum(is_outlier) > 2, outliers = list(Sepal.Length[is_outlier == T]), 
            extremes = list(Sepal.Length[is_extreme == T]))

# A tibble: 3 x 4
  Species    out_num2 outliers  extremes 
  <fct>      <lgl>    <list>    <list>   
1 setosa     FALSE    <dbl [0]> <dbl [0]>
2 versicolor FALSE    <dbl [0]> <dbl [0]>
3 virginica  FALSE    <dbl [1]> <dbl [0]>

I can combine this with a nested version of my data so that the groups line up: (EDITED)

nested <- iris %>%
  select(Species, Sepal.Length) %>%
  group_by(Species) %>%
  nest() %>%
  left_join(out_ext_num)

# A tibble: 3 x 5
# Groups:   Species [3]
  Species    data              out_num2 outliers  extremes 
  <fct>      <list>            <lgl>    <list>    <list>   
1 setosa     <tibble [50 x 1]> FALSE    <dbl [0]> <dbl [0]>
2 versicolor <tibble [50 x 1]> FALSE    <dbl [0]> <dbl [0]>
3 virginica  <tibble [50 x 1]> FALSE    <dbl [1]> <dbl [0]>

Now, for each group, if out_num2 is FALSE, I want to match values from the data tibble (containing vector Sepal.Length) with values from the outliers list. If they match, I want to replace that value in the tibble with NA. If out_num2 is TRUE, I want to match values from the data tibble with values from the extremes list and replace corresponding values in the tibble with NA. Right now, I can't even match values from the list with the tibble, period, let alone do it conditionally. I have tried using %in% within mutate() and map(), but sense that I am incorrectly referencing the Sepal.Length vector within the tibble:

require(purrr)

nested %>%
  mutate(Sepal.Length.o = map(data, ~ ifelse(Sepal.Length[.x %in% nested$outliers], NA, Sepal.Length)))

I've reviewed list referencing and several posts on matching values from different sites, but nothing quite works with this dplyr format. Is there another or better way to do this?

After figuring out how to match, I was planning to use nested ifelse statements to conditionally replace outlier or extreme values with NA, perhaps similar to this:

nested %>%
  mutate(Sepal.Length.o = ifelse(out_num2 == T, 
                         ifelse(match_tibble_with_extremes, NA, Sepal.Length),
                         ifelse(match_tibble_with_outliers, NA, Sepal.Length)))

But I'm not sure if that will work as I intend. Any help, especially with the matching step, would be greatly appreciated.

Upvotes: 2

Views: 215

Answers (1)

akrun
akrun

Reputation: 887173

Here is one option

library(dplyr)
library(purrr)
nested %>% 
   rowwise %>% 
   mutate(data = map2(data, if(out_num2) list(extremes) else list(outliers),  ~ replace(.x, .x %in% .y, NA)))

Upvotes: 1

Related Questions