Reputation: 49
I am trying to build term document matrix
from one pdf text. When I inspect the term document matrix
, I get this.
<<TermDocumentMatrix (terms: 7245, documents:342)>>
The number of document should 1 not 342, and 342 is number of pages in pdf files. I've tried use this code using R.
pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
myCorpus <- Corpus(VectorSource(text))
mytdm <- TermDocumentMatrix(myCorpus, control = list
(removeNumbers = TRUE,
removePunctuation = TRUE,
stopwords=stopwords_en,
stemming=TRUE)
)
inspect(mytdm)
Upvotes: 0
Views: 302
Reputation: 23598
Use the following code to collapse the pdf pages into 1 document.
pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
.....
rest of code
Upvotes: 0