SChatcha
SChatcha

Reputation: 147

Web scraping pdf files from HTML

How can I scrape the pdf documents from HTML?
I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrape is as follows.

https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx

Upvotes: 2

Views: 10954

Answers (1)

KenHBS
KenHBS

Reputation: 7174

When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files.

library(XML)
library(RCurl)

url <- "https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]

links contains all the URLs to the PDF-files you are trying to download.

Beware: many websites don't like it very much when you automatically scrape their documents and you get blocked.

With the links in place, you can start looping through the links and download them one by one and saving them in your working directory under the name destination. I decided to extract reasonable document names for your PDFs, based on the links (extracting the final piece after the last / in the urls

regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)

To avoid overloading the servers of the website, I have heard it is friendly to pause your scraping every once in a while, so therefore I use 'Sys.sleep()` to pause scraping for a time between 0 and 5 seconds:

for(i in seq_along(links)){
  download.file(links[i], destfile=destination[i])
  Sys.sleep(runif(1, 1, 5))
}

Upvotes: 4

Related Questions