Reputation: 566
I have a dataframe with an ID (V1) and some columns that contain strings with comma-separated values, i.e., "a,b" (V2). I want to filter the dataframe based on a match between any of a vector of values and any of the values in this string. I also want to preserve any other columns without naming them explicitly (V3).
df <- data.frame(V1 = 1:4, V2 = c("a,b", letters[1:3]), V3 = letters[20:23])
head(df)
V1 V2 V3
1 1 a,b t
2 2 a u
3 3 b v
4 4 c w
The output should contain any rows in which at least one part of V2 matches at least one of the values in a vector of values. I.e., if the criterion is c("a","b"), rows 1:3 should be preserved in full; if the criterion is c("a","d"), rows 1:2 should be preserved, etc. I would also like to preserve the complete string in V2.
I know I can use separate_rows, filter on the separated rows, then summarise (or aggregate) the dataframe again:
df %>%
separate_rows(V2, sep = ",") %>%
group_by_at(setdiff(names(df), "V2")) %>%
filter(V2 %in% c("a", "b")) %>%
summarise(paste(V2, collapse=","))
# A tibble: 3 x 3
# Groups: V1 [3]
V1 V3 `paste(V2, collapse = ",")`
<int> <chr> <chr>
1 1 t a,b
2 2 u a
3 3 v b
However, this has the disadvantage that the original string is not preserved. This makes a differences when not all parts of the string are contained in the critertion:
df %>%
separate_rows(V2, sep = ",") %>%
group_by_at(setdiff(names(df), "V2")) %>%
filter(V2 %in% c("a", "d")) %>%
summarise(paste(V2, collapse=","))
# A tibble: 2 x 3
# Groups: V1 [2]
V1 V3 `paste(V2, collapse = ",")`
<int> <chr> <chr>
1 1 t a
2 2 u a
I was wondering whether there is a solution that uses str_split() or a similar function directly inside filter(). Because this would return a TRUE/FALSE value for each part of the string, I tried to combine them using any(). This, however, does not work, because it is applied to any value returned by str_split() across the entire data frame:
# This does not include comma-separated values
df %>%
filter(
str_split(V2, ",") %in% c("a", "b")
)
V1 V2 V3
1 2 a u
2 3 b v
# This does not filter out non-matching values
df %>%
filter(
any(str_split(V2, ",") %in% c("a", "b"))
)
V1 V2 V3
1 1 a,b t
2 2 a u
3 3 b v
4 4 c w
Upvotes: 3
Views: 861
Reputation: 388962
I think instead of splitting the string you could think of creating a pattern to filter.
library(dplyr)
library(stringr)
get_rows <- function(data, vec) {
data %>%
filter(str_detect(V2, str_c(vec, collapse = '|')))
}
get_rows(df, c('a', 'b'))
# V1 V2 V3
#1 1 a,b t
#2 2 a u
#3 3 b v
get_rows(df, c('a', 'd'))
# V1 V2 V3
#1 1 a,b t
#2 2 a u
You can write this in base R as well :
get_rows <- function(data, vec) {
subset(data, grepl(paste0(vec, collapse = '|'), V2))
}
Upvotes: 2
Reputation: 18551
Maybe this approach can help with you original data (note that I updated your df
as mentioned in your comments).
The idea is to first use str_split
on V2
which will turn it into a list, then use map_lgl
to loop over the list with str_detect
wrapped in any
to return exactly one TRUE
or FALSE
per row. Within the str_detect
we are looking for exact matches of a
or b
by wrapping them in ^$
. And we can do this all inside filter
without adding a new variable.
library(tidyverse)
df <- data.frame(V1 = 1:4, V2 = c("a,b", "a", "b", "aba"), V3 = letters[20:23])
df %>%
filter(map_lgl(str_split(V2, ","),
~ any(str_detect(.x, "^a$|^b$"))))
#> V1 V2 V3
#> 1 1 a,b t
#> 2 2 a u
#> 3 3 b v
Created on 2020-10-03 by the reprex package (v0.3.0)
Upvotes: 3