Faster solution for extracting array of word-vectors from text string with R

I am looping over text to find and count specific words from various dictionaries. I use two FOR loops which are extremely slow and takes days to complete. Reproducible code below:

library(stringr)

#Sample data
tweets=data.frame(id=c(1,2,3),text=c("This is a tweet that contains word1",
                                     "And here you can find word1 and word2 word2",
                                     "And here is only one word3 and one word3a"))

words=data.frame(id=c(1,2,3),word=c("word1","word2","word3"))

for(i in 1:nrow(tweets)){
  for(j in 1:nrow(words)){
    term = paste("\\<",words[j,2],"\\>", sep="")
    if (str_count(tweets[i,2], term) != 0) {
     tmp <- data.frame(id=tweets[i,1],termfound=words[j,2],count=str_count(tweets[i,2], term), row.names=NULL)
     message("ID ",tweets[i,1]," - Word '",words[j,2],"' found ",str_count(tweets[i,2], term)," times")
     #sqlSave(myconn, tmp, "DataTable", append=T, rownames=F)
    }
  }
}

NOTES:
I have ~1M lines of text and ~25,000 words I am counting.
The Message line is just for debugging.
The final values are written to SQL - line commented out as it is not reproducible.

Any way to improve on this? I was thinking an APPLY function???

Cheers B

Upvotes: 1

Answers (3)

Rich Scriven

Reputation: 99341

Update : You could also try stringi within a data.table.

library(data.table); library(stringi)

## convert tweets to  data table and set key on 'id' column 
dtweets <- as.data.table(tweets)
setkey(dtweets, id)

## convert words to data.table and set up the regex
dtw <- as.data.table(words)
dtw[,term := stri_c("\\b", word, "\\b")]

## run stri_count_regex by each id 
dtn <- dt[dtw, stri_count_regex(text, term), by = key(dt)]
#    id V1
# 1:  1  1
# 2:  1  0
# 3:  1  0
# 4:  2  1
# 5:  2  2
# 6:  2  0
# 7:  3  0
# 8:  3  0
# 9:  3  1 

## melt the rows to columns
melted <- melt(dtn, id = 1L, measure = 2L)
dcast(melted, id ~ value, sum)
#   id 0 1 2
# 1  1 0 1 0
# 2  2 0 1 2
# 3  3 0 1 0

Original answer

Here's another method that takes the logical matches only, then calculates the result from those values. I had to use \\b for the word boundary in term.

library(stringi)

term <- stri_c("\\b", words$word, "\\b")

out <- vapply(seq_along(tweets$text), function(i) {
        a <- stri_detect_regex(tweets$text[i], term)
        a[a] <- cumsum(a[a != 0])
        a
    }, integer(nrow(tweets)))

cbind(tweets[1], `colnames<-`(out, words$word))
#   id word1 word2 word3
# 1  1     1     1     0
# 2  2     0     2     0
# 3  3     0     0     1

Upvotes: 5

Marat Talipov

Reputation: 13304

I had pretty much the same idea about this problem as Daddy the Runner:

term = paste("\\<",words$word,"\\>", sep="") # create a regex for every word
# [1] "\\<word1\\>" "\\<word2\\>" "\\<word3\\>"

m <- sapply(tweets$text,function(tweet) str_count(tweet,term)) # find a number of occurences of every word in every tweet
#      [,1] [,2] [,3]
# [1,]    1    1    0
# [2,]    0    2    0
# [3,]    0    0    1



library(reshape)
df <- melt(m) # convert the result into the data frame format
#   X1 X2 value
# 1  1  1     1
# 2  2  1     0
# 3  3  1     0
# 4  1  2     1
# 5  2  2     2
# 6  3  2     0
# 7  1  3     0
# 8  2  3     0
# 9  3  3     1

colnames(df) <- c('id.tweet','id.word','count')

tmp <- with(df,data.frame(id=id.tweet,termfound=words$word[id.word],count=count)) # create a data frame similar to the one in the example
# id termfound count
# 1  1     word1     1
# 2  2     word1     0
# 3  3     word1     0
# 4  1     word2     1
# 5  2     word2     2
# 6  3     word2     0
# 7  1     word3     0
# 8  2     word3     0
# 9  3     word3     1

Upvotes: 3

Daddy the Runner

Reputation: 603

Observation: your code is counting each word three times. Once in the IF statement, once in the tmp assignment and once in the debug message. Reducing the number of calls to the string counting function will definitely improve the efficiency of your code.

As mentioned above, the stringi package offers a faster set of string functions.

The following vectorized code will generate a 2-d matrix with the results you want which can then be transformed into the format needed for your database.

require(stringi)
tweets=data.frame(id=c(1,2,3),text=c("This is a tweet that contains word1",
                                     "And here you can find word1 and word2 word2",
                                     "And here is only one word3 and one word3a"),
                  stringsAsFactors = FALSE)
words=data.frame(id=c(1,2,3),word=c("word1","word2","word3"), stringsAsFactors = FALSE)
pat <- paste("\\b",words$word,"\\b", sep="")
sd <- function(text) { stri_count(text, regex=pat) }
results <- sapply(tweets$text, sd, USE.NAMES=F)
colnames(results) <- words$word
rownames(results) <- paste("ID", tweets$id)
results

Which produces the following output:

##      word1 word2 word3
## ID 1     1     1     0
## ID 2     0     2     0
## ID 3     0     0     1

Upvotes: 9

Faster solution for extracting array of word-vectors from text string with R

Answers (3)

Related Questions