S Front
S Front

Reputation: 353

Scraping PDFs of all linked websites

I would like to scrape official laws from websites (here is an example). The documents are accessible within a menu in the html website. I managed to extract links from websites such as github and download PDFs, however, I have difficulties extracting from this type of website. I tried the following code:

library(rvest)

# read html 
page <- read_html("https://bl.clex.ch/app/de/texts_of_law/780")

# from nodes I would like to get the links where the PDFs are stored
raw_list <- page %>%   # takes the page above for which we've read the html
  html_nodes("a") %>%  # find all links in the page
  html_attr("href")

No links can be found on this website as the result is an empty character string

character(0)

The Questions that I have:

  1. What is different about the menu on the linked website compared to for example PDFs stored on github accessible through the links on the main page of github project?
  2. How can I access the links and download all PDFs stored in this menu?

Upvotes: 0

Views: 121

Answers (1)

Abdessabour Mtk
Abdessabour Mtk

Reputation: 3888

Apparently the website you're trying to scrape is an angular based website. ie it uses xhr requests to load content. So after looking at developer tools in Chrome - Network tab on XHR requests.

You'll find that the website calls https://bl.clex.ch/api/de/texts_of_law/780 (basically changing app to api) this request returns a JSON string.

I tried parsing it with jsonlite but it gives an error so I used a regular expression to match all the entries that have pdf_link in them.

library(RCurl)
uri <- "https://bl.clex.ch/app/de/texts_of_law/780"
json <- getURL(sub('/app/', '/api/', uri, fixed=T))
stringr::str_match_all(json, '"(pdf_link[a-z_]*?)":"(.+?)",')[[1]][, 2:3]

Output

     [,1]                        [,2]                                                                                          
[1,] "pdf_link"                  "http://bl.clex.ch/frontend/versions/pdf_file_with_annex/1337?locale=de"                      
[2,] "pdf_link_tol"              "http://bl.clex.ch/frontend/versions/1337/download_pdf_file?locale=de"                        
[3,] "pdf_link_annexes"          "http://bl.clex.ch/frontend/structured_documents/3473/download_pdf_annex?locale=de"           
[4,] "pdf_link_tol_with_annexes" "http://bl.clex.ch/frontend/structured_documents/3473/download_pdf_file_and_annex?locale=de"  
[5,] "pdf_link"                  "http://bl.clex.ch/frontend/change_document_file_dictionaries/194/download_pdf_file?locale=de"

Upvotes: 1

Related Questions