Reputation: 189
I have a DocumentTermMatrix and I´d like to replace specific terms in this document and to create a frequency table.
The starting point is the original document as follows:
library(tm)
library(qdap)
df1 <- data.frame(word =c("test", "test", "teste", "hey", "heyyy", "hi"))
tdm <- as.DocumentTermMatrix(as.character(df1$word))
When I create a frequency table of the original document I get the correct results:
freq0 <- as.matrix(sort(colSums(as.matrix(tdm)), decreasing=TRUE))
freq0
So far so good. However, if replace some terms in the document then the new frequency table gets wrong results:
tdm$dimnames$Terms <- mgsub(c("teste", "heyyy"), c("test", "hey"), as.character(tdm$dimnames$Terms), fixed=T, trim=T)
freq1 <- as.matrix(sort(colSums(as.matrix(tdm)), decreasing=TRUE))
freq1
Obviously or perhaps some indexing in the document is wrong because even same terms are not regarded as identical while counting the terms.
This outcome should be the ideal case:
df2 <- data.frame(word =c("test", "test", "test", "hey", "hey", "hi"))
tdm2 <- as.DocumentTermMatrix(as.character(df2$word))
tdm2$dimnames$Terms <- mgsub(c("teste", "heyyy"), c("test", "hey"), as.character(tdm2$dimnames$Terms), fixed=T, trim=T)
freq2 <- as.matrix(sort(colSums(as.matrix(tdm2)), decreasing=TRUE))
freq2
Can anyone help me to figure out the problem?
Thx in advance
Upvotes: 2
Views: 249
Reputation: 887501
We can look at the structure of as.matrix(tdm)
str(as.matrix(tdm))
#num [1, 1:5] 1 1 1 2 1
# - attr(*, "dimnames")=List of 2
# ..$ Docs : chr "all"
# ..$ Terms: chr [1:5] "hey" "heyyy" "hi" "test" ...
which is one row, 5 column matrix, so colSums
is basically not doing anything.
xtabs(as.vector(tdm)~tdm$dimnames$Terms)
#tdm$dimnames$Terms
# hey heyyy hi test teste
# 1 1 1 2 1
and after replacing using mgsub
xtabs(as.vector(tdm)~tdm$dimnames$Terms)
#tdm$dimnames$Terms
# hey hi test
# 2 1 3
The xtabs
does the sum
of the vector
. This can also be done with tapply
tapply(as.vector(tdm), tdm$dimnames$Terms, FUN = sum)
If the number of rows are greater than 1, we can use colSums
tapply(colSums(as.matrix(tdm)), tdm$dimnames$Terms, FUN = sum)
# hey hi test
# 4 2 6
NOTE: The above output is after we made the changes with mgsub
Upvotes: 2