how to treat factors as strings and filter data frames with them

Question

I can't find a way to easily filter a data.frame on factors, that I thought I could use str_detect to treat as strings. I want to filter on df$kind not including flow-delivery, storage, or flow-channel. I could maybe add a column with mutate(kind2 = as.character(kind) and filter on that, but I'd rather not have the redundancy, and I'm sure I'm missing the obvious.

 library(dplyr)
 plot_monoth_ts <- function(df, yearmon, rawval, rawunit, dv, study, yrmin, yrmax) 
 {df %>% filter(str_detect(!kind, 'flow-delivery|storage|flow-channel')) %>%
 ggplot(aes(x = yearmon, y = rawval, color = study, linetype = dv))+geom_line()}

which returns this error:

  Warning message:
  In Ops.factor(kind) : ‘!’ not meaningful for factors

Any tips greatly appreciated.

thank you, Dave

DuckPyjamas · Accepted Answer

You're over-thinking it! :> No character conversion is necessary. As long as the factor has a label associated with each of its levels, you can refer to the levels as if they were strings.

iris %>% head

# Note that 'Species' is a Factor with 3 levels.

# Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 56           5.7         2.8          4.5         1.3 versicolor
# 44           5.0         3.5          1.6         0.6     setosa
# 104          6.3         2.9          5.6         1.8  virginica
# 123          7.7         2.8          6.7         2.0  virginica
# 149          6.2         3.4          5.4         2.3  virginica

omitted <- c("versicolor", "setosa")
filter(iris, !(Species %in% omitted)) %>% sample_n(5)

# Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 22          5.6         2.8          4.9         2.0 virginica
# 34          6.3         2.8          5.1         1.5 virginica
# 41          6.7         3.1          5.6         2.4 virginica
# 17          6.5         3.0          5.5         1.8 virginica
# 19          7.7         2.6          6.9         2.3 virginica

Note the !(x %in% y) construct.

Quick comparison of speed:

library(microbenchmark)

microbenchmark(filter(iris, !(Species %in% c("versicolor", "setosa"))))

# Unit: microseconds
# min     lq       mean     median   uq       max
# 568.189 575.8505 600.3869 580.8085 603.3435 870.7620

microbenchmark(filter(iris, !str_detect(as.character(Species), "versicolor|setosa")))

# Unit: microseconds
# min     lq       mean     median   uq      max
# 620.169 633.6910 671.0874 656.8275 687.325 928.1510

As expected, converting to character and then using regex pattern-matching is slower, even on a small dataset like iris.

how to treat factors as strings and filter data frames with them

Answers (1)

Related Questions