Reputation: 301
I’m working with Twitter data and I’m currently trying to find frequencies of bigrams in which the first word is “the”. I’ve written a function which seems to be doing what I want but is extremely slow (originally I wanted to see frequencies of all bigrams but I gave up because of the speed). Is there a faster way of solving this problem? I’ve heard about the RWeka package, but have trouble installing it, I get an error about (ERROR: dependencies ‘RWekajars’, ‘rJava’ are not available for package ‘RWeka’)…
required libraries: tau and tcltk
bigramThe <- function(dataset,column) {
bidata <- data.frame(x= character(0), y= numeric(0))
pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
for (i in 1:nrow(dataset)) {
a <- column[i]
bi<-textcnt(a, n = 2, method = "string")
tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
tweetbi$grepl<-grepl("the ",tweetbi$V1)
tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
bidata <- rbind(bidata, tweetbi)
setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
close(pb)
return(aggbi)
}
I have almost 500,000 rows of tweets stored in a column that I pass to the function. An example dataset would look like this:
text userid
tweet text 1 1
tweets text 2 2
the tweet text 3 3
Upvotes: 1
Views: 530
Reputation: 5494
You can get rid of the evil loop structure by collapsing the text column into one long string:
paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")
I expected to also need subset(bi, function(x) !grepl("*", x)
But it turns out that the textcnt method doesn't include bigrams with * in them, so you're good to go.
Upvotes: 1
Reputation: 23200
To use RWeka
, first run sudo apt-get install openjdk-6-jdk
(or install/re-install your JDK in Windows GUI) then try re-installing the package.
Should that fail, use download.file
to download the source .zip
file and install from source, i.e. install.packages("RWeka.zip", type = "source", repos = NULL)
.
If you want to speed things up without using a different package, consider using multicore
and re-writing the code to use an apply
function which can take advantage of parallelism.
Upvotes: 1