Reputation: 245
I need to build an Similarity Matrix by comparing terms of documents. So, for example, if Document1 and Document2 have 2 of the same terms, I need to write a 2 in my similarity matrix at m[1, 2]. My similarity matrix looks like this right now:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 NA NA NA NA NA NA NA NA
[2,] 0 0 NA NA NA NA NA NA NA
[3,] 0 0 0 NA NA NA NA NA NA
[4,] 0 0 0 0 NA NA NA NA NA
[5,] 0 0 0 0 0 NA NA NA NA
[6,] 0 0 0 0 0 0 NA NA NA
[7,] 0 0 0 0 0 0 0 NA NA
[8,] 0 0 0 0 0 0 0 0 NA
The documents and terms are inside a Document Term Matrix. Now I have to fill the similarity matrix by comparing all documents and their terms where it says NA in the similarity matrix. For every Term matched in an document pair I have to count +1 and inject the end value on the right place in the matrix.
My problem is, it seems I can't access the single documents and their terms inside the Document term Matrix. Is there any other way to perform this or am I missing something?
Here is the code:
install.packages("tm")
install.packages("openNLP")
install.packages("openNLPmodels.en")
Sys.setenv(NOAWT=TRUE)
library(tm)
library(openNLP)
library(openNLPmodels.en)
sample = c(
"count eagle alien",
"dis bound eagle",
"bound count eagle dis",
"count eagle dis alien",
"bound eagle",
"count dis alien",
"bound count alien",
"bound count",
"count eagle dis"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
# need to create similarity matrix here
#dist(dtm, method = "manhattan", diag = FALSE, upper = TRUE)
rowCount <- nrow(dtm)
similMatrix = matrix(nrow = rowCount - 1, ncol = rowCount)
show(similMatrix)
similMatrix[ row(similMatrix) >= col(similMatrix) ] <- 0
for(i in 1:(rowCount - 1)){ # rows
for (j in i+1:rowCount){ # cols
# need to compare document i and j here and write
# the value into similarity matrix
}
}
show(similMatrix)
Upvotes: 1
Views: 4059
Reputation: 552
I think you're missing one more row in your Similarity Matrix. Cause you don't get your last document covered. Mine is looking like this.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] NA NA NA NA NA NA NA NA NA
[2,] 1 NA NA NA NA NA NA NA NA
[3,] 2 3 NA NA NA NA NA NA NA
[4,] 3 2 3 NA NA NA NA NA NA
[5,] 1 2 2 1 NA NA NA NA NA
[6,] 2 1 2 3 0 NA NA NA NA
[7,] 2 1 2 2 1 2 NA NA NA
[8,] 1 1 2 1 1 1 2 NA NA
[9,] 2 2 3 3 1 2 1 1 NA
To get this result I did following steps.
mat=as.data.frame(as.matrix(dtm)) # you get the dataframe from DocumentTerm Matrix
rowCount <- nrow(dtm)
colCount <- ncol(dtm)
similMatrix = matrix(nrow = rowCount, ncol = rowCount)
similMatrix[ row(similMatrix) >= col(similMatrix) ] <- 0
for(i in 1:(rowCount)){ #set all columns NA you can change to zeros if you need later
similMatrix[i,i]=NA
} # then we will do the actual job
for(i in 1:rowCount ){ # rows
for (j in 1:rowCount ){ # cols
if(is.na(similMatrix[i,j])==F){
a=mat[i,]
b=mat[j,]
for(k in 1:colCount){ #n number of Cols in Document term matrix
if(a[k]==1 && a[k]==b[k]){
similMatrix[i,j]=similMatrix[i,j]+1
}
}
}
}
}
Upvotes: 3