adam.888
adam.888

Reputation: 7846

Importing a web pdf into R

I am trying to import a pdf that is on the web into R:

library(tm)

webpdf <- "https://www.lme.com/~/media/Files/Market%20data/COTR/2015/2015_01/Cotr%2019%20Jan%202015.pdf"
uri <- sprintf("file://%s", system.file(file.path("doc", webpdf), package = "tm"))
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                 language = "en",
                                                 id = "id1")
content(pdf)[1:13]
}
VCorpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))

I have not been able to do this and get an error message:

Error in system2(gs_cmd, c("-dNODISPLAY -q", sprintf("-sFile=%s", shQuote(file)),  : 
  '""' not found

Upvotes: 0

Views: 187

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

Lots of problems with the initial setup. This will get you the PDF content, but you should ask another question for the tm Corpus issues you're going to have.

library(tm)
library(httr) # this will make it easier to get to https conent

webpdf <- "https://www.lme.com/~/media/Files/Market%20data/COTR/2015/2015_01/Cotr%2019%20Jan%202015.pdf"

doc <- "cotr.pdf"

# save the file locally, write_disk() will act like a cache
stop_for_status(GET(webpdf, write_disk(doc)))

if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {

  pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = doc),
                                                   language = "en",
                                                   id = "id1")
  # httr also has a "content()" so make the call explicit
  NLP::content(pdf)[1:13]

}

print(str(pdf))

## List of 2
##  $ content: chr [1:113] "Commitment of Trader Report - Market Report as of 2015/01/21" "" "Metal" "AA" ...
##  $ meta   :List of 7
##   ..$ author       : NULL
##   ..$ datetimestamp: POSIXlt[1:1], format: "2015-01-21 08:59:10"
##   ..$ description  : NULL
##   ..$ heading      : NULL
##   ..$ id           : chr "cotr.pdf"
##   ..$ language     : chr "en"
##   ..$ origin       : NULL
##   ..- attr(*, "class")= chr "TextDocumentMeta"
##  - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## NULL

Upvotes: 1

Related Questions