Colombe
Colombe

Reputation: 1

R: import pdf and create TermDocumentMatrix with files names as id

I'm importing pdf to R in order to do some text analysis. I have a number of pdf files whose names are their publication year (one publication per year).

I would like to create a TermDocumentMatrix after importing them for which the first term "docs" (ie the first column of the tdm) takes the year of the publication rather than the number of the document. Indeed, at the moment the tdm assigns them numbers (1, 2, 3 etc...) when I create it.

Any ideas on how to do it? My code is below.

Thanks!

#creates the list of pdf files to be picked up (from the working directory)
files <- list.files(pattern = "pdf$")

#read the pdf files from the list (number of pages in brackets in front)
new_files <- sapply(files, pdf_text)

#create corpus
new_corp <- Corpus(VectorSource(new_files))

IMF_tdm <- TermDocumentMatrix(new_corp, control = list(removePunctuation = TRUE,
                                                         stopwords = TRUE,
                                                         tolower = TRUE,
                                                         stemming = TRUE,
                                                         removeNumbers = TRUE,
                                                         bounds = list(global =c(2, Inf)))) 

Upvotes: 0

Views: 413

Answers (1)

DuckPyjamas
DuckPyjamas

Reputation: 1659

Try readtext https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, I have used it in the past to read in plaintext and CSV files, and it can also convert and import PDFs. It will output a dataframe with document filename in one column and the entire document's text as a single string in the second column.

Here's the vignette example using some of the data files distributed with the readtext library:

## Read in Universal Declaration of Human Rights pdf files

(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), 
                    docvarsfrom = "filenames", 
                    docvarnames = c("document", "language"),
                    sep = "_"))

## readtext object consisting of 11 documents and 2 docvars.
## # data.frame [11 × 4]
##   doc_id           text                          document language
##   <chr>            <chr>                         <chr>    <chr>   
## 1 UDHR_chinese.pdf "\"世界人权宣言\n联合国\"..." UDHR     chinese 
## 2 UDHR_czech.pdf   "\"VŠEOBECNÁ \"..."           UDHR     czech   
## 3 UDHR_danish.pdf  "\"Den 10. de\"..."           UDHR     danish  
## 4 UDHR_english.pdf "\"Universal \"..."           UDHR     english 
## 5 UDHR_french.pdf  "\"Déclaratio\"..."           UDHR     french  
## 6 UDHR_greek.pdf   "\"ΟΙΚΟΥΜΕΝΙΚ\"..."           UDHR     greek   
## # ... with 5 more rows

Upvotes: 1

Related Questions