Karstein
Karstein

Reputation: 1

Naming of downloaded pdf-files does not work

When downloading many (48) pdf-files the nameing using str_match(myurl, "UniqueID=(.+)) fails. I see that the downloads goes fine but name does not work and when its done I have only one file named "NA".

I am downloading a number of pdfs from a UN organisation database. This is going fine as I see that all files being downloaded. However, all files naming goes wrong and in the end I have only one file called "NA".

library(downloader)
library(stringr)
for (myurl in pdfscollect) {
    filename<-paste("collected/", str_match(myurl, "UniqueID=(.+)")[2], ".pdf", sep="")
    download(myurl, filename)
    Sys.sleep(2)
}

I would expect all pdfs being named uniquely, but no naming happens and only one file in the end with "NA".

pdfscollect is file with all links. Example: pdfstest<-c("http://www.ilo.org/evalinfo/product/download.do;?type=document&id=8287", "http://www.ilo.org/evalinfo/product/download.do;?type=document&id=10523",….)

Upvotes: 0

Views: 216

Answers (2)

Karstein
Karstein

Reputation: 1

Thanks for the suggestion, @sindri_baldur. Acutally the result turns out the same, except that the name of the pdf file changes. I cannot open the pdf-file it either, I realise now. I think some of the problem is that the pdf-link is an "..download.do..." link (ilo.org/evalinfo/product/download.do;?type=document&id=8287). I guess I should fine another way to collect these pdfs.

Upvotes: 0

s_baldur
s_baldur

Reputation: 33603

If I understand correctly (?) the problem is that

paste("collected/", str_match(myurl, "UniqueID=(.+)")[2]

is returning a vector of NA when you are expecting the document ids:

[1] "8287"  "10523"

I suggest using instead something like the following (which does get the expected output):

str_extract(pdfstest, "(?<=id=)\\d+")

Here we use regular expressions to match any number of digits that follow immediately after the first id= of the urls in your vector.

Upvotes: 1

Related Questions