Reputation: 239
I have a dataframe that has two columns. One column contains list of sentences, other column contains words. For example:
words sentences
loose Loose connection several times a day on my tablet.
loud People don't speak loud or clear enough to hear voicemails
vice I strongly advice you to fix this issue
advice I strongly advice you to fix this issue
Now I want to filter this data frame such that I get only those rows that has words that exactly match with the words in the sentence:
words sentences
loose Loose connection several times a day on my tablet.
loud People don't speak loud or clear enough to hear voicemails
advice I strongly advice you to fix this issue
The word 'vice' did not match exactly so it has to be removed. I have nearly 20k rows in the dataframe. Can someone suggest me which method to use to accompolish this task so that I don't lose much performance.
Upvotes: 1
Views: 955
Reputation: 542
Most simple solution is using the stringr package:
df<- data.frame(words=c("went","zero", "vice"), sent=c("a man went to the park","one minus one is 0","any advice?"))
df$words <- paste0(" ",df$words," ")
df$sent <- paste0(" ",df$sent," ")
df$match <- str_detect(df$sent,df$words)
df.res <- df[df$match > 0,]
df.res$match<-NULL
df.res
Upvotes: 1
Reputation: 83215
Using:
library(stringi)
df[stri_detect_regex(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]
you get:
words sentences
1 loose Loose connection several times a day on my tablet.
2 loud People don't speak loud or clear enough to hear voicemails
4 advice I strongly advice you to fix this issue
Explanation:
tolower
.paste0
by wrapping the words in words
in wordboundaries (\\b
).stri_detect_regex
from the stringi-package to see in wich rows there is no match, resulting in a logicalvector with TRUE
& FALSE
values.As an alternative, you can also use str_detect
from the stringr
package (which is actually a wrapper around the stringi
package):
library(stringr)
df[str_detect(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]
Used data:
df <- structure(list(words = c("loose", "loud", "vice", "advice"),
sentences = c("Loose connection several times a day on my tablet.",
"People don't speak loud or clear enough to hear voicemails",
"I strongly advice you to fix this issue", "I strongly advice you to fix this issue")),
.Names = c("words", "sentences"), class = "data.frame", row.names = c(NA, -4L))
Upvotes: 3
Reputation: 23101
You can try something like the following:
df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]
df
words sentences
1 loose Loose connection several times a day on my tablet.
2 loud People dont speak loud or clear enough to hear voicemail
4 advice advice I strongly advice you to fix this issue
Upvotes: 2