Reputation: 1
When downloading many (48) pdf-files the nameing using str_match(myurl, "UniqueID=(.+))
fails. I see that the downloads goes fine but name does not work and when its done I have only one file named "NA".
I am downloading a number of pdfs from a UN organisation database. This is going fine as I see that all files being downloaded. However, all files naming goes wrong and in the end I have only one file called "NA".
library(downloader)
library(stringr)
for (myurl in pdfscollect) {
filename<-paste("collected/", str_match(myurl, "UniqueID=(.+)")[2], ".pdf", sep="")
download(myurl, filename)
Sys.sleep(2)
}
I would expect all pdfs being named uniquely, but no naming happens and only one file in the end with "NA".
pdfscollect is file with all links. Example: pdfstest<-c("http://www.ilo.org/evalinfo/product/download.do;?type=document&id=8287", "http://www.ilo.org/evalinfo/product/download.do;?type=document&id=10523",….)
Upvotes: 0
Views: 216
Reputation: 1
Thanks for the suggestion, @sindri_baldur. Acutally the result turns out the same, except that the name of the pdf file changes. I cannot open the pdf-file it either, I realise now. I think some of the problem is that the pdf-link is an "..download.do..." link (ilo.org/evalinfo/product/download.do;?type=document&id=8287). I guess I should fine another way to collect these pdfs.
Upvotes: 0
Reputation: 33603
If I understand correctly (?) the problem is that
paste("collected/", str_match(myurl, "UniqueID=(.+)")[2]
is returning a vector of NA
when you are expecting the document ids:
[1] "8287" "10523"
I suggest using instead something like the following (which does get the expected output):
str_extract(pdfstest, "(?<=id=)\\d+")
Here we use regular expressions to match any number of digits that follow immediately after the first id=
of the urls
in your vector.
Upvotes: 1