user3900757
user3900757

Reputation: 23

tm Package R - Terms Being Removed When Creating Corpus for Text Mining

I am fairly new to R, and this is my first time using the tm package for text mining. After creating a corpus and frequency matrix with my text vector, I noticed that some of the words have disappeared. Here is my code:

setwd("C:/Users/M/Dropbox/Research")
library(tm)
data=read.table("abstract.tex.clean",stringsAsFactors=FALSE) 
data=unlist(data,use.names=FALSE) 
stopwords=read.table("stopwords.txt",stringsAsFactors=FALSE) 
stopwords=unlist(stopwords,use.names=FALSE) 
i=1
while(i<=length(stopwords)){data=data[data != stopwords[i]];i=i+1}

x = VCorpus(VectorSource(data))
dtm = DocumentTermMatrix(x)
dtm2 = as.matrix(dtm)
frequency = colSums(dtm2)
frequency = sort(frequency, decreasing=TRUE)

After I run this, the command

frequency["tas"]

and

length(which(data=="tas"))

yield the same frequency result (35).

However,

frequency["ta"]

returns N/A

where

length(which(data=="ta"))

is 77.

Help would be appreciated as to why these terms disappeared!

Upvotes: 0

Views: 125

Answers (1)

MrFlick
MrFlick

Reputation: 206167

By default when you call DocumentTermMatrix() it only tracks with at least three characters. You can change the min and max word lengths via the control= parameter.

words<-c("tas","ta","pas","pa")
Terms(DocumentTermMatrix(VCorpus(VectorSource(words))))
# [1] "pas" "tas"
Terms(DocumentTermMatrix(VCorpus(VectorSource(words)), control=list(wordLengths=c(1,Inf))))
# [1] "pa"  "pas" "ta"  "tas"

For more information, I suggest reading the ?DocumentTermMatrix help page.

Upvotes: 1

Related Questions