Reputation: 1139
I'm looking for some simple vectorized approach for my for loop in R. I have the following data frame with sentences and two dictionaries of positive and negative words:
# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
And now I create replicates of the original data frame to simulate a big data set:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL
For my next step, I'll have to do descending ordering of words in dictionaries with their sentiment score (pos word = 1 and neg word = -1).
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
Then I define the following function with a for loop:
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
# library(qdapRegex)
sentence <- rm_white(sentence)
}
}
score
}
And I call the previous function on sentences in my data frame:
# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user system elapsed
# 1054.19 0.09 1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)
Desired output is:
Words user SentimentScore2
just right size and i love this notebook 1 2
benefits great laptop 2 1
wouldnt bad notebook 3 1
very good quality 4 1
orgtop 5 0
.
.
.
And so forth...
Please, can anyone help me to reduce computing time of my original approach. Because of my beginners programming skills in R I'm in the end :-) Any of your help or advice will be very appreciated. Thank you very much in advance.
Upvotes: 4
Views: 246
Reputation: 4534
In the spirit of "Teach somebody to fish is better than to give a fish", I'll walk you through that:
Make a copy of your code: you are going to mess it up!
Find the bottleneck:
1a: make the problem smaller:
Rep <- 100
df.expanded <- as.data.frame(replicate(nRep,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]
1b: keep a reference solution: you'll be changing your code and there are few activities as amazing at introducing bugs than optimizing a code!
sentRef <- sent
and add the same but commented out at the end of your code to remember where is your reference. To make it even easier to check you are not messing up your code you can test it automatically at the end of your code:
library("testthat")
expect_equal(sent,sentRef)
1c: Trigger the profiler around the code to look at:
Rprof()
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
Rprof(NULL)
1d: view the result, with base R:
summaryRprof()
There are also nicer tools, you can check package profileR or lineprof
lineprof is my tool of choice and here a real added value, allowing to narrow down the problem to these two lines:
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
Fix it.
3.1 Fortunately the main problem is fairly easy: you don't need the first line to be in the function, move it before. By the way the same applies to your paste0(). Your code becomes:
matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- length(grep(matchWords[x],sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
require(qdapRegex)
# sentence <- rm_white(sentence)
}
}
score
}
That changes the execution time for 1000 reps from
5.64s to 2.32s. Not a bad investment!
3.2 The next bootle neck is the "count <-" line, but I think shadow had just the right answer :-) Combined we get :
matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
count <- grepl(matchWords[x],sentence) # count them
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
require(qdapRegex)
# sentence <- rm_white(sentence)
}
score
}
Here that makes 0.18s or 31 times faster...
Upvotes: 5
Reputation: 22293
You can easily vectorize your scoreSentence2
function, since grep
, grepl
are already vectorized:
scoreSentence <- function(sentence){
score <- rep(0, length(sentence))
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- grepl(matchWords, sentence) # count them
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
sentence <- rm_white(sentence)
}
return(score)
}
scoreSentence(sent$words)
Note thet the count
does not actually count the number of times the expression appears in one sentence (neither in your nor in my version). It just tells you if the expression appears at all. If you want to actually count them, you could use the following instead.
count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))
Upvotes: 1