Mairuu
Mairuu

Reputation: 23

Using scores in sentiment analysis with R

Generally I am interested in getting a process working faster.

I am using R to do sentiment analysis on a German corpus of about 8000 documents. Instead of just counting positive and negative words I have a value between -1 and 1 assigned to about 3000 different terms. As I am not using the stem-functuion and still want to get all the inflected forms of German grammar my wordlists become even longer.

For matching I am using this code at the moment:

score.sum <- rep(0, length(texts))
for (i in 1:length(texts)){
for (j in 1:length(sent.words)){
if(sent.words[j] %in% strsplit(texts[i], split=" ")[[1]] {
score.sum[i] <- score.sum[i] + sent.words_score[j]
}}}

As a mini-example one could use:

texts <- c("I like ice cream. It is great","I hate flying because it makes me sick","If I get bored I do something fun")

sent.words <- c("like","great","hate","sick","bored","fun","joy")
sent.words_score <- c(0.3,0.7,-0.5,-0.4,-0.4,0.3,0.5)

Maybe the calculations are taking longer than u want them as well. In my context with the 8000 documents i takes about 6 hours. So do u know of a way to avoid the dubble if-loop and get the computation faster?

Thanks in advance already Mairuu

Upvotes: 2

Views: 450

Answers (2)

Umashankar Das
Umashankar Das

Reputation: 601

I'm coding a sentiment analyser in c++. And I use a TRIES data structure to store all the words. The response is very fast. A success per word is O(n) with n being length of string while failure is obviously less than that. Just something to consider to improve performance.

Upvotes: 0

agstudy
agstudy

Reputation: 121578

strplit is vectorized so you can do it once.

Also no need to use for here , use sapply to avoid initialization and side effect.

sapply(strsplit(texts, split=" "),
      function(x)sum(sent.words_score[sent.words  %in% x]))

Upvotes: 2

Related Questions