martinkabe
martinkabe

Reputation: 1139

Vectorizing for loops to speed up a program in R

I'm looking for some simple vectorized approach for my for loop in R. I have the following data frame with sentences and two dictionaries of positive and negative words:

# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
                         "wouldnt bad notebook", "very good quality", "orgtop",
                         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")

And now I create replicates of the original data frame to simulate a big data set:

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL

For my next step, I'll have to do descending ordering of words in dictionaries with their sentiment score (pos word = 1 and neg word = -1).

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL

Then I define the following function with a for loop:

# Sentiment score function
scoreSentence2 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    if(count){
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
      sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
      # library(qdapRegex)
      sentence <- rm_white(sentence)
    }
  }
  score
}

And I call the previous function on sentences in my data frame:

# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user       system    elapsed
# 1054.19    0.09      1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)

Desired output is:

Words                                             user      SentimentScore2
just right size and i love this notebook          1         2
benefits great laptop                             2         1
wouldnt bad notebook                              3         1
very good quality                                 4         1
orgtop                                            5         0
  .
  .
  .

And so forth...

Please, can anyone help me to reduce computing time of my original approach. Because of my beginners programming skills in R I'm in the end :-) Any of your help or advice will be very appreciated. Thank you very much in advance.

Upvotes: 4

Views: 246

Answers (2)

cmbarbu
cmbarbu

Reputation: 4534

In the spirit of "Teach somebody to fish is better than to give a fish", I'll walk you through that:

  1. Make a copy of your code: you are going to mess it up!

  2. Find the bottleneck:

    1a: make the problem smaller:

    Rep  <- 100
    df.expanded <- as.data.frame(replicate(nRep,sent$words))
    library(zoo)
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]
    

    1b: keep a reference solution: you'll be changing your code and there are few activities as amazing at introducing bugs than optimizing a code!

    sentRef <- sent
    

    and add the same but commented out at the end of your code to remember where is your reference. To make it even easier to check you are not messing up your code you can test it automatically at the end of your code:

    library("testthat")
    expect_equal(sent,sentRef)
    

    1c: Trigger the profiler around the code to look at:

    Rprof()
    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
    Rprof(NULL)
    

    1d: view the result, with base R:

    summaryRprof()
    

    There are also nicer tools, you can check package profileR or lineprof

    lineprof is my tool of choice and here a real added value, allowing to narrow down the problem to these two lines:

    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
    
  3. Fix it.

    3.1 Fortunately the main problem is fairly easy: you don't need the first line to be in the function, move it before. By the way the same applies to your paste0(). Your code becomes:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- length(grep(matchWords[x],sentence)) # count them
            if(count){
                score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
                sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
                require(qdapRegex)
                # sentence <- rm_white(sentence)
            }
        }
        score
    }
    

    That changes the execution time for 1000 reps from
    5.64s to 2.32s. Not a bad investment!

    3.2 The next bootle neck is the "count <-" line, but I think shadow had just the right answer :-) Combined we get :

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- grepl(matchWords[x],sentence) # count them
            score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
            sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
            require(qdapRegex)
            # sentence <- rm_white(sentence)
        }
        score
    }
    

Here that makes 0.18s or 31 times faster...

Upvotes: 5

shadow
shadow

Reputation: 22293

You can easily vectorize your scoreSentence2 function, since grep, grepl are already vectorized:

scoreSentence <- function(sentence){
  score <- rep(0, length(sentence))
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- grepl(matchWords, sentence) # count them
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
    sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
    sentence <- rm_white(sentence)
  }
  return(score)
}
scoreSentence(sent$words)

Note thet the count does not actually count the number of times the expression appears in one sentence (neither in your nor in my version). It just tells you if the expression appears at all. If you want to actually count them, you could use the following instead.

count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))

Upvotes: 1

Related Questions