Reputation: 7846
I am trying to import a pdf that is on the web into R:
library(tm)
webpdf <- "https://www.lme.com/~/media/Files/Market%20data/COTR/2015/2015_01/Cotr%2019%20Jan%202015.pdf"
uri <- sprintf("file://%s", system.file(file.path("doc", webpdf), package = "tm"))
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:13]
}
VCorpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
I have not been able to do this and get an error message:
Error in system2(gs_cmd, c("-dNODISPLAY -q", sprintf("-sFile=%s", shQuote(file)), :
'""' not found
Upvotes: 0
Views: 187
Reputation: 78792
Lots of problems with the initial setup. This will get you the PDF content, but you should ask another question for the tm
Corpus
issues you're going to have.
library(tm)
library(httr) # this will make it easier to get to https conent
webpdf <- "https://www.lme.com/~/media/Files/Market%20data/COTR/2015/2015_01/Cotr%2019%20Jan%202015.pdf"
doc <- "cotr.pdf"
# save the file locally, write_disk() will act like a cache
stop_for_status(GET(webpdf, write_disk(doc)))
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = doc),
language = "en",
id = "id1")
# httr also has a "content()" so make the call explicit
NLP::content(pdf)[1:13]
}
print(str(pdf))
## List of 2
## $ content: chr [1:113] "Commitment of Trader Report - Market Report as of 2015/01/21" "" "Metal" "AA" ...
## $ meta :List of 7
## ..$ author : NULL
## ..$ datetimestamp: POSIXlt[1:1], format: "2015-01-21 08:59:10"
## ..$ description : NULL
## ..$ heading : NULL
## ..$ id : chr "cotr.pdf"
## ..$ language : chr "en"
## ..$ origin : NULL
## ..- attr(*, "class")= chr "TextDocumentMeta"
## - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## NULL
Upvotes: 1