pachadotdev
pachadotdev

Reputation: 3765

How to count common words and store the result in a matrix?

I have a lot of text sentences that I want to compare but here's Little Red Riding Hood for the example

text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"

I want to create a matrix that counts common words just like this

text1_split <- unlist(strsplit(text1, " "))
text2_split <- unlist(strsplit(text2, " "))
text3_split <- unlist(strsplit(text3, " "))

length(intersect(text1_split, text2_split))
length(intersect(text2_split, text3_split))

texts <- c("text1","text2","text3")
data <- data.frame(texts)
data[, texts] <- NA
rownames(data) <- texts
data <- data[,-1]

data[1,1] <- length(intersect(text1_split, text1_split))
data[1,2] <- length(intersect(text1_split, text2_split))
data[1,3] <- length(intersect(text1_split, text3_split))

The result of my matrix is this

      text1 text2 text3
text1     4     1     0
text2    NA    NA    NA
text3    NA    NA    NA

Is there a way to complete the matrix in an efficient way? I have more than 100 sentences to compare. This is a post with something similar but not equal: Count common words in two strings in R

Upvotes: 1

Views: 260

Answers (2)

Joseph Wood
Joseph Wood

Reputation: 7597

Try this:

CommonWordsMatrixOld <- function(vList) {
    v <- lapply(vList, tolower)
    do.call(rbind, lapply(v, function(x) {
             xSplit <- strsplit(x, " ")[[1]]
             do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
        }))
}

myText <- list(text1, text2, text3)

Calling it we have:

CommonWordsMatrixOld(myText)
     [,1] [,2] [,3]
[1,]    4    1    0
[2,]    1    6    1
[3,]    0    1    8

And it is decently fast for data the size the OP is requesting. The data was obtained here:

testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)

set.seed(1111)
myTestText <- lapply(1:100, function(x) {
         paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
    })

myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
 reform credit publish judge Senate publish fresh restaurant publish
 version Senate critical release recall relation version"

system.time(test1 <- CommonWordsMatrixOld(myTestText))
 user  system elapsed 
0.625   0.009   0.646

Here is the output:

test1[1:10,1:10]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    9    3    5    1    3    4    4    2    2     1
 [2,]    3    5    3    1    1    3    3    0    0     1
 [3,]    5    3   12    0    3    8    4    3    2     1
 [4,]    1    1    0    1    0    0    1    0    0     0
 [5,]    3    1    3    0    4    2    1    1    1     0
 [6,]    4    3    8    0    2   13    7    4    1     1
 [7,]    4    3    4    1    1    7   10    4    1     1
 [8,]    2    0    3    0    1    4    4    7    3     0
 [9,]    2    0    2    0    1    1    1    3    4     0
[10,]    1    1    1    0    0    1    1    0    0     2

Update

Here is a much faster algorithm that cuts out many unnecessary operations and takes advantage of lower.tri while remaining very general.

CommonWordsMatrixNew <- function(vList) {
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
    s <- length(v)
    m <- do.call(rbind, lapply(1L:s, function(x) {
        c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
    }))
    m[lower.tri(m)] <- t(m)[lower.tri(m)]
    m
}

To give you an idea of the performance increase, here are some benchmarks.(It should be noted that the OP's solution isn't splitting the vector, so it is not a true comparison). The New algo is almost twice as fast as the OP's solution.

microbenchmark(New=CommonWordsMatrixNew(myTestText), 
               Old=CommonWordsMatrixOld(myTestText),
               Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
 New  78.64434  79.07127  86.10754  79.72828  81.39679 137.0695    10
 Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306    10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535    10

identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE

The new algorithm reduces the number of calls to strsplit by n^2 - n times (e.g. in the example of above, strplit gets called 10000 times in the original algo, and only 100 times in the updated version). Additionally, since the resulting matrix is symmetric, there is no need to compute the interaction between each sentence more than one time, hence the x = 1:s and y = x:s in the lapply functions. The number of computations for these loops reduces from n^2 to the nth triangle number = (n*(n+1)/2) (e.g. in our example above from 10000 to 5050). After that, we are relying on the power of indexing in R, which is generally much faster than manually manufacturing.

Upvotes: 1

pachadotdev
pachadotdev

Reputation: 3765

I found that splitting beforehand increases speed so that

CommonWordsMatrix <- function(vList) {
  v <- lapply(vList, tolower)
  do.call(rbind, lapply(v, function(x) {
    do.call(c, lapply(v, function(y) length(intersect(x, y))))
  }))
}

is a nice choice to go (x and y are pre-splitted vectors of words)

Upvotes: 0

Related Questions