Venu
Venu

Reputation: 239

Filter a data frame based on matching values (exact) between two columns

I have a dataframe that has two columns. One column contains list of sentences, other column contains words. For example:

words   sentences
loose   Loose connection several times a day on my tablet.  
loud    People don't speak loud or clear enough to hear voicemails
vice    I strongly advice you to fix this issue
advice  I strongly advice you to fix this issue

Now I want to filter this data frame such that I get only those rows that has words that exactly match with the words in the sentence:

words   sentences
loose   Loose connection several times a day on my tablet.  
loud    People don't speak loud or clear enough to hear voicemails
advice  I strongly advice you to fix this issue   

The word 'vice' did not match exactly so it has to be removed. I have nearly 20k rows in the dataframe. Can someone suggest me which method to use to accompolish this task so that I don't lose much performance.

Upvotes: 1

Views: 955

Answers (3)

Oli
Oli

Reputation: 542

Most simple solution is using the stringr package:

df<- data.frame(words=c("went","zero", "vice"), sent=c("a man went to the park","one minus one is 0","any advice?"))

df$words <- paste0(" ",df$words," ")
df$sent <- paste0(" ",df$sent," ")


df$match <- str_detect(df$sent,df$words)

df.res <- df[df$match > 0,]
df.res$match<-NULL
df.res

Upvotes: 1

Jaap
Jaap

Reputation: 83215

Using:

library(stringi)
df[stri_detect_regex(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]

you get:

   words                                                  sentences
1  loose         Loose connection several times a day on my tablet.
2   loud People don't speak loud or clear enough to hear voicemails
4 advice                    I strongly advice you to fix this issue

Explanation:

  • Convert the capitals in the sentences to small letter with tolower.
  • Create a regex vector with paste0 by wrapping the words in words in wordboundaries (\\b).
  • Use stri_detect_regex from the stringi-package to see in wich rows there is no match, resulting in a logicalvector with TRUE & FALSE values.
  • Subset with the logical vector.

As an alternative, you can also use str_detect from the stringr package (which is actually a wrapper around the stringi package):

library(stringr)
df[str_detect(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]

Used data:

df <- structure(list(words = c("loose", "loud", "vice", "advice"), 
                     sentences = c("Loose connection several times a day on my tablet.", 
                                   "People don't speak loud or clear enough to hear voicemails", 
                                   "I strongly advice you to fix this issue", "I strongly advice you to fix this issue")), 
                .Names = c("words", "sentences"), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 3

Sandipan Dey
Sandipan Dey

Reputation: 23101

You can try something like the following:

df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]

df
   words                                                sentences
1  loose       Loose connection several times a day on my tablet.
2   loud People dont speak loud or clear enough to hear voicemail
4 advice          advice  I strongly advice you to fix this issue

Upvotes: 2

Related Questions