Reputation: 598
I am looping over text to find and count specific words from various dictionaries. I use two FOR loops which are extremely slow and takes days to complete. Reproducible code below:
library(stringr)
#Sample data
tweets=data.frame(id=c(1,2,3),text=c("This is a tweet that contains word1",
"And here you can find word1 and word2 word2",
"And here is only one word3 and one word3a"))
words=data.frame(id=c(1,2,3),word=c("word1","word2","word3"))
for(i in 1:nrow(tweets)){
for(j in 1:nrow(words)){
term = paste("\\<",words[j,2],"\\>", sep="")
if (str_count(tweets[i,2], term) != 0) {
tmp <- data.frame(id=tweets[i,1],termfound=words[j,2],count=str_count(tweets[i,2], term), row.names=NULL)
message("ID ",tweets[i,1]," - Word '",words[j,2],"' found ",str_count(tweets[i,2], term)," times")
#sqlSave(myconn, tmp, "DataTable", append=T, rownames=F)
}
}
}
NOTES:
I have ~1M lines of text and ~25,000 words I am counting.
The Message line is just for debugging.
The final values are written to SQL - line commented out as it is not reproducible.
Any way to improve on this? I was thinking an APPLY function???
Cheers B
Upvotes: 1
Views: 224
Reputation: 99341
Update : You could also try stringi
within a data.table
.
library(data.table); library(stringi)
## convert tweets to data table and set key on 'id' column
dtweets <- as.data.table(tweets)
setkey(dtweets, id)
## convert words to data.table and set up the regex
dtw <- as.data.table(words)
dtw[,term := stri_c("\\b", word, "\\b")]
## run stri_count_regex by each id
dtn <- dt[dtw, stri_count_regex(text, term), by = key(dt)]
# id V1
# 1: 1 1
# 2: 1 0
# 3: 1 0
# 4: 2 1
# 5: 2 2
# 6: 2 0
# 7: 3 0
# 8: 3 0
# 9: 3 1
## melt the rows to columns
melted <- melt(dtn, id = 1L, measure = 2L)
dcast(melted, id ~ value, sum)
# id 0 1 2
# 1 1 0 1 0
# 2 2 0 1 2
# 3 3 0 1 0
Original answer
Here's another method that takes the logical matches only, then calculates the result from those values. I had to use \\b
for the word boundary in term
.
library(stringi)
term <- stri_c("\\b", words$word, "\\b")
out <- vapply(seq_along(tweets$text), function(i) {
a <- stri_detect_regex(tweets$text[i], term)
a[a] <- cumsum(a[a != 0])
a
}, integer(nrow(tweets)))
cbind(tweets[1], `colnames<-`(out, words$word))
# id word1 word2 word3
# 1 1 1 1 0
# 2 2 0 2 0
# 3 3 0 0 1
Upvotes: 5
Reputation: 13304
I had pretty much the same idea about this problem as Daddy the Runner:
term = paste("\\<",words$word,"\\>", sep="") # create a regex for every word
# [1] "\\<word1\\>" "\\<word2\\>" "\\<word3\\>"
m <- sapply(tweets$text,function(tweet) str_count(tweet,term)) # find a number of occurences of every word in every tweet
# [,1] [,2] [,3]
# [1,] 1 1 0
# [2,] 0 2 0
# [3,] 0 0 1
library(reshape)
df <- melt(m) # convert the result into the data frame format
# X1 X2 value
# 1 1 1 1
# 2 2 1 0
# 3 3 1 0
# 4 1 2 1
# 5 2 2 2
# 6 3 2 0
# 7 1 3 0
# 8 2 3 0
# 9 3 3 1
colnames(df) <- c('id.tweet','id.word','count')
tmp <- with(df,data.frame(id=id.tweet,termfound=words$word[id.word],count=count)) # create a data frame similar to the one in the example
# id termfound count
# 1 1 word1 1
# 2 2 word1 0
# 3 3 word1 0
# 4 1 word2 1
# 5 2 word2 2
# 6 3 word2 0
# 7 1 word3 0
# 8 2 word3 0
# 9 3 word3 1
Upvotes: 3
Reputation: 603
Observation: your code is counting each word three times. Once in the IF statement, once in the tmp assignment and once in the debug message. Reducing the number of calls to the string counting function will definitely improve the efficiency of your code.
As mentioned above, the stringi package offers a faster set of string functions.
The following vectorized code will generate a 2-d matrix with the results you want which can then be transformed into the format needed for your database.
require(stringi)
tweets=data.frame(id=c(1,2,3),text=c("This is a tweet that contains word1",
"And here you can find word1 and word2 word2",
"And here is only one word3 and one word3a"),
stringsAsFactors = FALSE)
words=data.frame(id=c(1,2,3),word=c("word1","word2","word3"), stringsAsFactors = FALSE)
pat <- paste("\\b",words$word,"\\b", sep="")
sd <- function(text) { stri_count(text, regex=pat) }
results <- sapply(tweets$text, sd, USE.NAMES=F)
colnames(results) <- words$word
rownames(results) <- paste("ID", tweets$id)
results
Which produces the following output:
## word1 word2 word3
## ID 1 1 1 0
## ID 2 0 2 0
## ID 3 0 0 1
Upvotes: 9