Reputation: 301

Slow bigram frequency function in R

I’m working with Twitter data and I’m currently trying to find frequencies of bigrams in which the first word is “the”. I’ve written a function which seems to be doing what I want but is extremely slow (originally I wanted to see frequencies of all bigrams but I gave up because of the speed). Is there a faster way of solving this problem? I’ve heard about the RWeka package, but have trouble installing it, I get an error about (ERROR: dependencies ‘RWekajars’, ‘rJava’ are not available for package ‘RWeka’)…

required libraries: tau and tcltk

bigramThe <- function(dataset,column) {
    bidata <- data.frame(x= character(0), y= numeric(0))
    pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
    for (i in 1:nrow(dataset)) {
        a <- column[i]
        bi<-textcnt(a, n = 2, method = "string")
        tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
        tweetbi$grepl<-grepl("the ",tweetbi$V1)
        tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
        bidata <- rbind(bidata, tweetbi)
        setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
    aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
    close(pb)
    return(aggbi)
}

I have almost 500,000 rows of tweets stored in a column that I pass to the function. An example dataset would look like this:

text                userid

tweet text 1           1
tweets text 2          2
the tweet text 3       3

Upvotes: 1

Answers (2)

Guy Adini

Reputation: 5494

You can get rid of the evil loop structure by collapsing the text column into one long string:

paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")

I expected to also need subset(bi, function(x) !grepl("*", x)

But it turns out that the textcnt method doesn't include bigrams with * in them, so you're good to go.

Upvotes: 1

Hack-R

Reputation: 23200

To use RWeka, first run sudo apt-get install openjdk-6-jdk (or install/re-install your JDK in Windows GUI) then try re-installing the package.

Should that fail, use download.file to download the source .zip file and install from source, i.e. install.packages("RWeka.zip", type = "source", repos = NULL).

If you want to speed things up without using a different package, consider using multicore and re-writing the code to use an apply function which can take advantage of parallelism.

Upvotes: 1

Slow bigram frequency function in R

Answers (2)

Related Questions