simoncolumbus
simoncolumbus

Reputation: 566

str_split() within filter()

I have a dataframe with an ID (V1) and some columns that contain strings with comma-separated values, i.e., "a,b" (V2). I want to filter the dataframe based on a match between any of a vector of values and any of the values in this string. I also want to preserve any other columns without naming them explicitly (V3).

df <- data.frame(V1 = 1:4, V2 = c("a,b", letters[1:3]), V3 = letters[20:23])
head(df)
  V1  V2 V3
1  1 a,b  t
2  2   a  u
3  3   b  v
4  4   c  w

The output should contain any rows in which at least one part of V2 matches at least one of the values in a vector of values. I.e., if the criterion is c("a","b"), rows 1:3 should be preserved in full; if the criterion is c("a","d"), rows 1:2 should be preserved, etc. I would also like to preserve the complete string in V2.

I know I can use separate_rows, filter on the separated rows, then summarise (or aggregate) the dataframe again:

df %>% 
  separate_rows(V2, sep = ",") %>%
  group_by_at(setdiff(names(df), "V2")) %>%
  filter(V2 %in% c("a", "b")) %>%
  summarise(paste(V2, collapse=","))

# A tibble: 3 x 3
# Groups:   V1 [3]
     V1 V3    `paste(V2, collapse = ",")`
  <int> <chr> <chr>                      
1     1 t     a,b                        
2     2 u     a                          
3     3 v     b   

However, this has the disadvantage that the original string is not preserved. This makes a differences when not all parts of the string are contained in the critertion:

df %>% 
  separate_rows(V2, sep = ",") %>%
  group_by_at(setdiff(names(df), "V2")) %>%
  filter(V2 %in% c("a", "d")) %>%
  summarise(paste(V2, collapse=","))

# A tibble: 2 x 3
# Groups:   V1 [2]
     V1 V3    `paste(V2, collapse = ",")`
  <int> <chr> <chr>                      
1     1 t     a                          
2     2 u     a  

I was wondering whether there is a solution that uses str_split() or a similar function directly inside filter(). Because this would return a TRUE/FALSE value for each part of the string, I tried to combine them using any(). This, however, does not work, because it is applied to any value returned by str_split() across the entire data frame:

# This does not include comma-separated values
df %>%
  filter(
    str_split(V2, ",") %in% c("a", "b")
  )
  V1 V2 V3
1  2  a  u
2  3  b  v

# This does not filter out non-matching values
df %>%
  filter(
    any(str_split(V2, ",") %in% c("a", "b"))
  )
  V1  V2 V3
1  1 a,b  t
2  2   a  u
3  3   b  v
4  4   c  w

Upvotes: 3

Views: 861

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388962

I think instead of splitting the string you could think of creating a pattern to filter.

library(dplyr)
library(stringr)

get_rows <- function(data, vec) {
  data %>%
    filter(str_detect(V2, str_c(vec, collapse = '|')))
}

get_rows(df, c('a', 'b'))
#  V1  V2 V3
#1  1 a,b  t
#2  2   a  u
#3  3   b  v

get_rows(df, c('a', 'd'))
#  V1  V2 V3
#1  1 a,b  t
#2  2   a  u

You can write this in base R as well :

get_rows <- function(data, vec) {
  subset(data, grepl(paste0(vec, collapse = '|'), V2))
}

Upvotes: 2

TimTeaFan
TimTeaFan

Reputation: 18551

Maybe this approach can help with you original data (note that I updated your df as mentioned in your comments).

The idea is to first use str_split on V2 which will turn it into a list, then use map_lgl to loop over the list with str_detect wrapped in any to return exactly one TRUE or FALSE per row. Within the str_detect we are looking for exact matches of a or b by wrapping them in ^$. And we can do this all inside filter without adding a new variable.

library(tidyverse)

df <- data.frame(V1 = 1:4, V2 = c("a,b", "a", "b", "aba"), V3 = letters[20:23])

df %>% 
  filter(map_lgl(str_split(V2, ","),
                 ~ any(str_detect(.x, "^a$|^b$"))))
#>   V1  V2 V3
#> 1  1 a,b  t
#> 2  2   a  u
#> 3  3   b  v

Created on 2020-10-03 by the reprex package (v0.3.0)

Upvotes: 3

Related Questions