Reputation: 239
Am trying to get only those words from a list that are present in a given sentence.The words can include bigram words as well. For example,
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
myresult should be :
"really good" "better"
I have 1000 sentences like this on which i need to compare the words. The word list is also bigger. i tried brute force method using grep command but it took a lot of time (as expected). I am looking to get the matching words in a way that the performance is better.
Upvotes: 0
Views: 139
Reputation: 2541
What about
unlist(sapply(wordList, function(x) grep(x, sentence)))
Upvotes: 0
Reputation: 239
I was able to use @epi99's answer with a slight modification.
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
Upvotes: 0
Reputation: 4378
require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))
# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)
# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE),
grams,
by=c('wordList'='grams'))
matches
wordList
1 really good
2 better
Upvotes: 2