Reputation: 1
I'm importing pdf to R in order to do some text analysis. I have a number of pdf files whose names are their publication year (one publication per year).
I would like to create a TermDocumentMatrix after importing them for which the first term "docs" (ie the first column of the tdm) takes the year of the publication rather than the number of the document. Indeed, at the moment the tdm assigns them numbers (1, 2, 3 etc...) when I create it.
Any ideas on how to do it? My code is below.
Thanks!
#creates the list of pdf files to be picked up (from the working directory)
files <- list.files(pattern = "pdf$")
#read the pdf files from the list (number of pages in brackets in front)
new_files <- sapply(files, pdf_text)
#create corpus
new_corp <- Corpus(VectorSource(new_files))
IMF_tdm <- TermDocumentMatrix(new_corp, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global =c(2, Inf))))
Upvotes: 0
Views: 413
Reputation: 1659
Try readtext
https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, I have used it in the past to read in plaintext and CSV files, and it can also convert and import PDFs. It will output a dataframe with document filename in one column and the entire document's text as a single string in the second column.
Here's the vignette example using some of the data files distributed with the readtext
library:
## Read in Universal Declaration of Human Rights pdf files
(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"),
docvarsfrom = "filenames",
docvarnames = c("document", "language"),
sep = "_"))
## readtext object consisting of 11 documents and 2 docvars.
## # data.frame [11 × 4]
## doc_id text document language
## <chr> <chr> <chr> <chr>
## 1 UDHR_chinese.pdf "\"世界人权宣言\n联合国\"..." UDHR chinese
## 2 UDHR_czech.pdf "\"VŠEOBECNÁ \"..." UDHR czech
## 3 UDHR_danish.pdf "\"Den 10. de\"..." UDHR danish
## 4 UDHR_english.pdf "\"Universal \"..." UDHR english
## 5 UDHR_french.pdf "\"Déclaratio\"..." UDHR french
## 6 UDHR_greek.pdf "\"ΟΙΚΟΥΜΕΝΙΚ\"..." UDHR greek
## # ... with 5 more rows
Upvotes: 1