Yehuda
Yehuda

Reputation: 1893

which() function in filter() with dplyr

I am trying to filter a data set then set the outliers to the mean. Sample data frame:

structure(list(INDEX = c(1, 2, 3, 4, 5, 6), TARGET_WINS = c(39, 
70, 86, 70, 82, 75), TEAM_BATTING_H = c(1445, 1339, 1377, 1387, 
1297, 1279), TEAM_BATTING_2B = c(194, 219, 232, 209, 186, 200
), TEAM_BATTING_3B = c(39, 22, 35, 38, 27, 36), TEAM_BATTING_HR = c(13, 
190, 137, 96, 102, 92), TEAM_BATTING_BB = c(143, 685, 602, 451, 
472, 443), TEAM_BATTING_SO = c(842, 1075, 917, 922, 920, 973), 
    TEAM_BASERUN_SB = c(NA, 37, 46, 43, 49, 107), TEAM_BASERUN_CS = c(NA, 
    28, 27, 30, 39, 59), TEAM_BATTING_HBP = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), TEAM_PITCHING_H = c(9364, 
    1347, 1377, 1396, 1297, 1279), TEAM_PITCHING_HR = c(84, 191, 
    137, 97, 102, 92), TEAM_PITCHING_BB = c(927, 689, 602, 454, 
    472, 443), TEAM_PITCHING_SO = c(5456, 1082, 917, 928, 920, 
    973), TEAM_FIELDING_E = c(1011, 193, 175, 164, 138, 123), 
    TEAM_FIELDING_DP = c(NA, 155, 153, 156, 168, 149)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Using dplyr, I filter the outliers, then attempt to mutate the TEAM_FIELDING_E column based on the corrected (non-outlier) mean:

train %>% 
  filter(which(boxplot.stats(train$TEAM_FIELDING_E)$out %in% train$TEAM_FIELDING_E, arr.ind = TRUE) == TRUE) %>% 
  mutate(
    TEAM_FIELDING_E = NA,
    TEAM_FIELDING_E = mean(train$TEAM_FIELDING_E)
  )

This returns error Error in filter_impl(.data, quo) : Result must have length 2276, not 303 (the original data set contains 303 TEAM_FIELDING_E outliers and 2276 rows). How do I utilize filter() such that my mutate() will only affect those filtered rows?

Upvotes: 0

Views: 1819

Answers (1)

Jake Kaupp
Jake Kaupp

Reputation: 8072

Within dplyr verbs, use bare variable names and not using [[ or $. Additionally if you're trying to filter on a value, you can just filter on the value directly rather than trying to use which to determine the position of the match.

For this case, you can get what you want with an if_else within mutate.

out <- boxplot.stats(train$TEAM_FIELDING_E)$out

 train %>% 
  mutate(TEAM_FIELDING_E = if_else(TEAM_FIELDING_E %in% out, mean(TEAM_FIELDING_E[!(TEAM_FIELDING_E %in% out)]), TEAM_FIELDING_E))

Upvotes: 2

Related Questions