Reputation: 315
I'm looking to find a match of an exact bunch of strings in a R dataframe column containing strings.
Here's the format in which I have my bunch of reference strings which will be stored in the variable splitval
:
library(gsubfn)
#Splitting each rule into its individual parameter elements
str <- strsplit(gsub("\\,\\+"," +", gsub("=>","", gsubfn(".", list("{" = "", "}" = ""), gsub("corpsi", "+corpsi", "{dog} => {pet}")))), split='+', fixed=TRUE)
parameters <- data.frame(do.call(rbind, str)) #Creating a df of the split parameters
parameters <- data.frame(t(parameters))
parameters <- parameters[parameters$t.parameters.!="",]
parameters <- trimws(parameters, "r")
#Applying filter on all the parameters of a single rule row
splitval = strsplit(parameters[1],split=' ', fixed=TRUE)
splitval = lapply(list(splitval[[1]]), function(z){ z[z != ""]}) #Eliminating the "" instances
So now, splitval
has the following value:
[[1]]
[1] "dog" "pet"
Now my objective is to filter out all those row entries of the following dataframe where the string column's entries have both the exact words dog and pet.
Note: It should not filter out strings containing phrases like doganimal pets or dogsareanimals and petssss
This is my dataframe:
df <- data.frame(Srno = 1:5, Description = c("dog is my pet", "doganimal pets country", "my pet is my dog", "dogsareanimals and petssss", "a pet dog is great"))
Which looks like this:
Hence, I need only rows 1,3 & 5 in my extract since only these contain the exclusive strings "dog" and "pet" together (in no specific order)
But when I use the following code, I get all the rows of the dataframe since all the strings contain the two keywords of reference - grep is not serving the intended purpose.
extract_df <- df[(grep(splitval[[1]][1], df$Description)),]
for(k in 2:length(splitval[[1]]))
{
extract_df <- extract_df[(grep(splitval[[1]][k], df$Description)),]
}
So can anyone help me to get only rows 1,3 & 5 in the output extracted dataframe?
Upvotes: 0
Views: 117
Reputation: 389265
Assuming that splitval
can have many words in it and will not always have two fixed words in it you can split string for each word and select rows that have all
the words in vec
.
In base R you can do this as :
vec <- splitval[[1]]
#For this case
#vec <- c("dog", "pet")
subset(df, sapply(strsplit(df$Description, '\\s+'), function(x) all(vec %in% x)))
# Srno Description
#1 1 dog is my pet
#3 3 my pet is my dog
#5 5 a pet dog is great
Using tidyverse
:
library(tidyverse)
df %>% filter(map_lgl(str_split(df$Description, '\\s+'), ~all(vec %in% .x)))
Upvotes: 1