Reputation: 239
I am trying to match a list of words with a list of sentences and form a data frame with the matching words and sentences. For example:
words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")
The expected result (a dataframe) is as follows:
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good, happy
This is how it works -
I am using the following code to achieve this.
lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
sentence <- sentence_df[i,]$comment
#sentence <- gsub('[[:punct:]]', "", sentence)
#sentence <- gsub('[[:cntrl:]]', "", sentence)
#sentence <- gsub('\\d+', "", sentence)
sentence <- tolower(sentence)
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into data frame
words <- c(unigrams, bigrams)
#if(sentence_df[i,]$ave_sentiment)
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)
pos.matches <- na.omit(pos.matches)
neg.matches <- na.omit(neg.matches)
positiveList <- pos.words[pos.matches]
negativeList <- neg.words[neg.matches]
if(length(positiveList)==0){
positiveList <- c("-")
}
if(length(negativeList)==0){
negativeList <- c("-")
}
negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")
positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))
}
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)
I have 28k sentences and 65k words to match with. The above code takes 45 seconds to accomplish the task. Any suggestions on how to improve the performance of the code as the current approach takes a lot of time?
Edit:
I want to get only those words which exactly matches with the words in the sentences. For example :
words <- c('sin','vice','crashes')
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')
Now for the above case my output should be as follows:
sentences words
Since the app crashes frequently, I advice you guys to fix crahses
the issue ASAP
Upvotes: 3
Views: 235
Reputation: 109874
You can do this in the latest version of sentimentr with the extract_sentiment_terms
but you'll have to make a sentiment key first and assign value to the words:
pos <- c("far better","good","great","sombre","happy")
neg <- c('sin','vice','crashes')
sentences <- c('Since the app crashes frequently, I advice you guys to fix the issue ASAP',
"This document is far better", "This is a great app","The night skies were sombre and starless",
"The app is too good and i am happy using it", "This is how it works")
library(sentimentr)
(sentkey <- as_key(data.frame(c(pos, neg), c(rep(1, length(pos)), rep(-1, length(neg))), stringsAsFactors = FALSE)))
## x y
## 1: crashes -1
## 2: far better 1
## 3: good 1
## 4: great 1
## 5: happy 1
## 6: sin -1
## 7: sombre 1
## 8: vice -1
extract_sentiment_terms(sentences, sentkey)
## element_id sentence_id negative positive
## 1: 1 1 crashes
## 2: 2 1 far better
## 3: 3 1 great
## 4: 4 1 sombre
## 5: 5 1 good,happy
## 6: 6 1
Upvotes: 0
Reputation: 239
i was able to use @David Arenburg answer with some modification. Here is what i did. I used the following (suggested by David) to form the data frame.
df <- data.frame(sentences) ;
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))
The problem with the above approach is that it does not do the exact word match. So I used the following to filter out the words that did not exactly match with the words in the sentence.
df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))
After applying the above line the output data frame changes as follows.
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good
The app is too good and i am happy using it happy
This is how it works -
Since the app crashes frequently, I advice you guys to fix
the issue ASAP crahses
Since the app crashes frequently, I advice you guys to fix
the issue ASAP vice
Since the app crashes frequently, I advice you guys to fix
the issue ASAP sin
Now apply the following filter to the data frame to remove those words that are not an exact match to those words present in the sentence.
df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]
Now my resulting data frame will be as follows.
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good
The app is too good and i am happy using it happy
This is how it works -
Since the app crashes frequently, I advice you guys to fix
the issue ASAP crahses
stri_detect_fixed reduced my computation time a lot. The remaining process did not take up much time. Thanks to @David for pointing me out in the right direction.
Upvotes: 1