Reputation: 3765
I have a lot of text sentences that I want to compare but here's Little Red Riding Hood for the example
text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"
I want to create a matrix that counts common words just like this
text1_split <- unlist(strsplit(text1, " "))
text2_split <- unlist(strsplit(text2, " "))
text3_split <- unlist(strsplit(text3, " "))
length(intersect(text1_split, text2_split))
length(intersect(text2_split, text3_split))
texts <- c("text1","text2","text3")
data <- data.frame(texts)
data[, texts] <- NA
rownames(data) <- texts
data <- data[,-1]
data[1,1] <- length(intersect(text1_split, text1_split))
data[1,2] <- length(intersect(text1_split, text2_split))
data[1,3] <- length(intersect(text1_split, text3_split))
The result of my matrix is this
text1 text2 text3
text1 4 1 0
text2 NA NA NA
text3 NA NA NA
Is there a way to complete the matrix in an efficient way? I have more than 100 sentences to compare. This is a post with something similar but not equal: Count common words in two strings in R
Upvotes: 1
Views: 260
Reputation: 7597
Try this:
CommonWordsMatrixOld <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
xSplit <- strsplit(x, " ")[[1]]
do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
}))
}
myText <- list(text1, text2, text3)
Calling it we have:
CommonWordsMatrixOld(myText)
[,1] [,2] [,3]
[1,] 4 1 0
[2,] 1 6 1
[3,] 0 1 8
And it is decently fast for data the size the OP is requesting. The data was obtained here:
testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)
set.seed(1111)
myTestText <- lapply(1:100, function(x) {
paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
})
myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
reform credit publish judge Senate publish fresh restaurant publish
version Senate critical release recall relation version"
system.time(test1 <- CommonWordsMatrixOld(myTestText))
user system elapsed
0.625 0.009 0.646
Here is the output:
test1[1:10,1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 9 3 5 1 3 4 4 2 2 1
[2,] 3 5 3 1 1 3 3 0 0 1
[3,] 5 3 12 0 3 8 4 3 2 1
[4,] 1 1 0 1 0 0 1 0 0 0
[5,] 3 1 3 0 4 2 1 1 1 0
[6,] 4 3 8 0 2 13 7 4 1 1
[7,] 4 3 4 1 1 7 10 4 1 1
[8,] 2 0 3 0 1 4 4 7 3 0
[9,] 2 0 2 0 1 1 1 3 4 0
[10,] 1 1 1 0 0 1 1 0 0 2
Update
Here is a much faster algorithm that cuts out many unnecessary operations and takes advantage of lower.tri
while remaining very general.
CommonWordsMatrixNew <- function(vList) {
v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
s <- length(v)
m <- do.call(rbind, lapply(1L:s, function(x) {
c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
}))
m[lower.tri(m)] <- t(m)[lower.tri(m)]
m
}
To give you an idea of the performance increase, here are some benchmarks.(It should be noted that the OP's solution isn't splitting the vector, so it is not a true comparison). The New algo is almost twice as fast as the OP's solution.
microbenchmark(New=CommonWordsMatrixNew(myTestText),
Old=CommonWordsMatrixOld(myTestText),
Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
New 78.64434 79.07127 86.10754 79.72828 81.39679 137.0695 10
Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306 10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535 10
identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE
The new algorithm reduces the number of calls to strsplit
by n^2 - n
times (e.g. in the example of above, strplit
gets called 10000
times in the original algo, and only 100
times in the updated version). Additionally, since the resulting matrix is symmetric, there is no need to compute the interaction between each sentence more than one time, hence the x = 1:s
and y = x:s
in the lapply
functions. The number of computations for these loops reduces from n^2
to the nth triangle number = (n*(n+1)/2)
(e.g. in our example above from 10000
to 5050
). After that, we are relying on the power of indexing in R
, which is generally much faster than manually manufacturing.
Upvotes: 1
Reputation: 3765
I found that splitting beforehand increases speed so that
CommonWordsMatrix <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
do.call(c, lapply(v, function(y) length(intersect(x, y))))
}))
}
is a nice choice to go (x and y are pre-splitted vectors of words)
Upvotes: 0