Hilfit19
Hilfit19

Reputation: 49

build term document matrix from PDF file

I am trying to build term document matrix from one pdf text. When I inspect the term document matrix, I get this.

<<TermDocumentMatrix (terms: 7245, documents:342)>>

The number of document should 1 not 342, and 342 is number of pages in pdf files. I've tried use this code using R.

pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
myCorpus <- Corpus(VectorSource(text))

mytdm <- TermDocumentMatrix(myCorpus, control = list
                         (removeNumbers = TRUE, 
                         removePunctuation = TRUE, 
                         stopwords=stopwords_en, 
                         stemming=TRUE)
)
inspect(mytdm)

Upvotes: 0

Views: 302

Answers (1)

phiver
phiver

Reputation: 23598

Use the following code to collapse the pdf pages into 1 document.

pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
.....
rest of code

Upvotes: 0

Related Questions