Reputation: 23
Generally I am interested in getting a process working faster.
I am using R to do sentiment analysis on a German corpus of about 8000 documents. Instead of just counting positive and negative words I have a value between -1 and 1 assigned to about 3000 different terms. As I am not using the stem-functuion and still want to get all the inflected forms of German grammar my wordlists become even longer.
For matching I am using this code at the moment:
score.sum <- rep(0, length(texts))
for (i in 1:length(texts)){
for (j in 1:length(sent.words)){
if(sent.words[j] %in% strsplit(texts[i], split=" ")[[1]] {
score.sum[i] <- score.sum[i] + sent.words_score[j]
}}}
As a mini-example one could use:
texts <- c("I like ice cream. It is great","I hate flying because it makes me sick","If I get bored I do something fun")
sent.words <- c("like","great","hate","sick","bored","fun","joy")
sent.words_score <- c(0.3,0.7,-0.5,-0.4,-0.4,0.3,0.5)
Maybe the calculations are taking longer than u want them as well. In my context with the 8000 documents i takes about 6 hours. So do u know of a way to avoid the dubble if-loop and get the computation faster?
Thanks in advance already Mairuu
Upvotes: 2
Views: 450
Reputation: 601
I'm coding a sentiment analyser in c++. And I use a TRIES data structure to store all the words. The response is very fast. A success per word is O(n) with n being length of string while failure is obviously less than that. Just something to consider to improve performance.
Upvotes: 0
Reputation: 121578
strplit
is vectorized so you can do it once.
Also no need to use for
here , use sapply
to avoid initialization and side effect.
sapply(strsplit(texts, split=" "),
function(x)sum(sent.words_score[sent.words %in% x]))
Upvotes: 2