Reputation: 23
I am fairly new to R, and this is my first time using the tm package for text mining. After creating a corpus and frequency matrix with my text vector, I noticed that some of the words have disappeared. Here is my code:
setwd("C:/Users/M/Dropbox/Research")
library(tm)
data=read.table("abstract.tex.clean",stringsAsFactors=FALSE)
data=unlist(data,use.names=FALSE)
stopwords=read.table("stopwords.txt",stringsAsFactors=FALSE)
stopwords=unlist(stopwords,use.names=FALSE)
i=1
while(i<=length(stopwords)){data=data[data != stopwords[i]];i=i+1}
x = VCorpus(VectorSource(data))
dtm = DocumentTermMatrix(x)
dtm2 = as.matrix(dtm)
frequency = colSums(dtm2)
frequency = sort(frequency, decreasing=TRUE)
After I run this, the command
frequency["tas"]
and
length(which(data=="tas"))
yield the same frequency result (35).
However,
frequency["ta"]
returns N/A
where
length(which(data=="ta"))
is 77.
Help would be appreciated as to why these terms disappeared!
Upvotes: 0
Views: 125
Reputation: 206167
By default when you call DocumentTermMatrix()
it only tracks with at least three characters. You can change the min and max word lengths via the control=
parameter.
words<-c("tas","ta","pas","pa")
Terms(DocumentTermMatrix(VCorpus(VectorSource(words))))
# [1] "pas" "tas"
Terms(DocumentTermMatrix(VCorpus(VectorSource(words)), control=list(wordLengths=c(1,Inf))))
# [1] "pa" "pas" "ta" "tas"
For more information, I suggest reading the ?DocumentTermMatrix
help page.
Upvotes: 1