IanLux
IanLux

Reputation: 13

Scraping pdf files from web

This question was answered here (Web scraping pdf files from HTML) but the solution doesn't work for me on either my target url or the target url of the op. I'm not supposed to ask this question as an answer to the earlier post so I'm starting a new Q.

My code is exactly as per the op and the error message that I receive is "Error in download.file(links[i], destfile = save_names[i]) : invalid 'url' argument"

The code I'm using is:

install.packages("RCurl")
install.packages("XML")
library(XML)
library(RCurl)
url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]


regex_match <- regexpr("[^/]+$", links)
save_names <- regmatches(links, regex_match)

for(i in seq_along(links)){
  download.file(links[i], destfile=save_names[i])
  Sys.sleep(runif(1, 1, 5))

}

Any help much appreciated! Thanks

Upvotes: 0

Views: 774

Answers (1)

IanLux
IanLux

Reputation: 13

Solved! I don't know why this works but it does. I have swapped the for loop for the following code and it works:

Map (function(u, d) download.file(u, d, mode='wb'), links, save_names) 

Upvotes: 0

Related Questions