How to find an exact set of strings in a column of varied strings in R dataframe?

Question

I'm looking to find a match of an exact bunch of strings in a R dataframe column containing strings.

Here's the format in which I have my bunch of reference strings which will be stored in the variable splitval:

library(gsubfn)
#Splitting each rule into its individual parameter elements
str <- strsplit(gsub("\,\+"," +", gsub("=>","",  gsubfn(".", list("{" = "", "}" = ""), gsub("corpsi", "+corpsi", "{dog} => {pet}")))), split='+', fixed=TRUE)
parameters <- data.frame(do.call(rbind, str)) #Creating a df of the split parameters
parameters <- data.frame(t(parameters))
parameters <- parameters[parameters$t.parameters.!="",]
parameters <- trimws(parameters, "r")

#Applying filter on all the parameters of a single rule row
splitval = strsplit(parameters[1],split=' ', fixed=TRUE)
splitval = lapply(list(splitval[[1]]), function(z){ z[z != ""]}) #Eliminating the "" instances

So now, splitval has the following value:

[[1]]
[1] "dog" "pet"

Now my objective is to filter out all those row entries of the following dataframe where the string column's entries have both the exact words dog and pet.

Note: It should not filter out strings containing phrases like doganimal pets or dogsareanimals and petssss

This is my dataframe:

df <- data.frame(Srno = 1:5, Description = c("dog is my pet", "doganimal pets country", "my pet is my dog", "dogsareanimals and petssss", "a pet dog is great"))

Which looks like this:

Hence, I need only rows 1,3 & 5 in my extract since only these contain the exclusive strings "dog" and "pet" together (in no specific order)

But when I use the following code, I get all the rows of the dataframe since all the strings contain the two keywords of reference - grep is not serving the intended purpose.

extract_df <- df[(grep(splitval[[1]][1], df$Description)),]
  for(k in 2:length(splitval[[1]]))
  {
    extract_df  <- extract_df[(grep(splitval[[1]][k], df$Description)),]
  }

So can anyone help me to get only rows 1,3 & 5 in the output extracted dataframe?

Ronak Shah · Accepted Answer

Assuming that splitval can have many words in it and will not always have two fixed words in it you can split string for each word and select rows that have all the words in vec.

In base R you can do this as :

vec <- splitval[[1]]
#For this case
#vec <- c("dog", "pet")

subset(df, sapply(strsplit(df$Description, '\s+'), function(x) all(vec %in% x)))

#  Srno        Description
#1    1      dog is my pet
#3    3   my pet is my dog
#5    5 a pet dog is great

Using tidyverse :

library(tidyverse)
df %>% filter(map_lgl(str_split(df$Description, '\s+'), ~all(vec %in% .x)))

How to find an exact set of strings in a column of varied strings in R dataframe?

Answers (1)

Related Questions