Reputation: 23
I have two columns in a data frame with strings that I want to compare. The first one is a vector of strings and the second one is a list with a mini vector of strings in each element. Imagine to have a data frame like this one:
V L
"Anameone" "name" "asd"
"Bnametwo" "dfg"
"Cnamethree" "hey" "C" "hi"
I would like to see if some of the words in the first element of L appears in the first element of V, if some of the words in the second element of L appears in the second element of V... and so on.
I could do what I wanted with a loop like this:
for (i in c(1:3)){
df$matches[i] <- any(df$L[[i]],grepl, df$V[i],ignore.case = T))
}
So that the output is:
> df$matches
[1] "TRUE" "FALSE" "TRUE"
But actually I have around 100.000 instead of 3 rows and it takes too long indeed. I haven't been able to figure out how to do this a bit more efficiently, any ideas? All my other attempts without using indexs ended up with what would be a matrix 3x3 in this example because it compares "all with all", and I think this could be still worse than a for.
Upvotes: 2
Views: 1085
Reputation: 320
Something like this?
df <- data.frame(V = c('Anameone','Bnametwo','Cnamethree'),
L = I(list(c('name','asd'),c('dfg'),c('hey','C','hi'))))
sapply(1:nrow(df), function(x) any(sapply(df$L[[x]], function(y) grepl(y, df$V[x]))))
# [1] TRUE FALSE TRUE
Upvotes: 1
Reputation: 3007
You can use purrr::map2_lgl()
to iterate over both columns, testing if each element of l
is in v
with stringr::str_detect()
, and then use any()
to get just TRUE
or FALSE
if there are any matches.
library(dplyr)
library(purrr)
library(stringr)
df <- tibble(
v = c("Anameone", "Bnametwo", "Cnamethree"),
l = list(c("name", "asd"), "dfg", c("hey", "C", "hi"))
)
mutate(df, matches = map2_lgl(v, l, ~ str_detect(.x, .y) %>% any()))
#> # A tibble: 3 x 3
#> v l matches
#> <chr> <list> <lgl>
#> 1 Anameone <chr [2]> TRUE
#> 2 Bnametwo <chr [1]> FALSE
#> 3 Cnamethree <chr [3]> TRUE
Upvotes: 1
Reputation: 3986
sapply should work:
df<-data.frame(V=c("Anameone","Bnametwo","Cnamethree"),
L=I(list(c("name","asd"),"dfg",c("hey","C","hi"))))
sapply(as.character(df$V),function(x)
{grepl(paste(unlist(df$L[1]),collapse="|"),x)})
you'll have to check if it's faster than using the for loop. I couldn't recreate your example.
Upvotes: 0