Performance issue while trying to match a list of words with a list of sentences in R

Question

I am trying to match a list of words with a list of sentences and form a data frame with the matching words and sentences. For example:

words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")

The expected result (a dataframe) is as follows:

sentences                                               words
This document is far better                               better
This is a great app                                       great
The night skies were sombre and starless                  sombre 
The app is too good and i am happy using it               good, happy
This is how it works                                      -

I am using the following code to achieve this.

lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
        sentence <- sentence_df[i,]$comment
        #sentence <- gsub('[[:punct:]]', "", sentence)
        #sentence <- gsub('[[:cntrl:]]', "", sentence)
        #sentence <- gsub('\d+', "", sentence)
        sentence <- tolower(sentence)
        # get  unigrams  from the sentence
        unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

        # get bigrams from the sentence
        bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

        # .. and combine into data frame
        words <- c(unigrams, bigrams)
        #if(sentence_df[i,]$ave_sentiment)

        pos.matches <- match(words, pos.words)
        neg.matches <- match(words, neg.words)
        pos.matches <- na.omit(pos.matches)
        neg.matches <- na.omit(neg.matches)
        positiveList <- pos.words[pos.matches]
        negativeList <- neg.words[neg.matches]

        if(length(positiveList)==0){
          positiveList <- c("-")
        }
        if(length(negativeList)==0){
          negativeList <- c("-")
        }
        negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
        positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")

        positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
        negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))

    }    
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)

I have 28k sentences and 65k words to match with. The above code takes 45 seconds to accomplish the task. Any suggestions on how to improve the performance of the code as the current approach takes a lot of time?

Edit:

I want to get only those words which exactly matches with the words in the sentences. For example :

words <- c('sin','vice','crashes') 
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')

Now for the above case my output should be as follows:

sentences                                                           words
Since the app crashes frequently, I advice you guys to fix        crahses
the issue ASAP

Venu · Accepted Answer

i was able to use @David Arenburg answer with some modification. Here is what i did. I used the following (suggested by David) to form the data frame.

df <- data.frame(sentences) ; 
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))

The problem with the above approach is that it does not do the exact word match. So I used the following to filter out the words that did not exactly match with the words in the sentence.

df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))

After applying the above line the output data frame changes as follows.

sentences                                                      words
This document is far better                                    better
This is a great app                                            great
The night skies were sombre and starless                       sombre 
The app is too good and i am happy using it                    good
The app is too good and i am happy using it                    happy
This is how it works                                            -
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 crahses
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 vice
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 sin

Now apply the following filter to the data frame to remove those words that are not an exact match to those words present in the sentence.

df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\s+')))),]

Now my resulting data frame will be as follows.

    sentences                                                      words
    This document is far better                                    better
    This is a great app                                            great
    The night skies were sombre and starless                       sombre 
    The app is too good and i am happy using it                    good
    The app is too good and i am happy using it                    happy
    This is how it works                                            -
    Since the app crashes frequently, I advice you guys to fix     
    the issue ASAP                                                 crahses

stri_detect_fixed reduced my computation time a lot. The remaining process did not take up much time. Thanks to @David for pointing me out in the right direction.

Performance issue while trying to match a list of words with a list of sentences in R

Answers (2)

Related Questions