Reputation: 353
I would like to scrape official laws from websites (here is an example). The documents are accessible within a menu in the html website. I managed to extract links from websites such as github and download PDFs, however, I have difficulties extracting from this type of website. I tried the following code:
library(rvest)
# read html
page <- read_html("https://bl.clex.ch/app/de/texts_of_law/780")
# from nodes I would like to get the links where the PDFs are stored
raw_list <- page %>% # takes the page above for which we've read the html
html_nodes("a") %>% # find all links in the page
html_attr("href")
No links can be found on this website as the result is an empty character string
character(0)
The Questions that I have:
Upvotes: 0
Views: 121
Reputation: 3888
Apparently the website you're trying to scrape is an angular
based website. ie it uses xhr
requests to load content. So after looking at developer tools in Chrome
- Network
tab on XHR
requests.
You'll find that the website calls https://bl.clex.ch/api/de/texts_of_law/780
(basically changing app to api) this request returns a JSON string.
I tried parsing it with jsonlite but it gives an error so I used a regular expression to match all the entries that have pdf_link
in them.
library(RCurl)
uri <- "https://bl.clex.ch/app/de/texts_of_law/780"
json <- getURL(sub('/app/', '/api/', uri, fixed=T))
stringr::str_match_all(json, '"(pdf_link[a-z_]*?)":"(.+?)",')[[1]][, 2:3]
[,1] [,2]
[1,] "pdf_link" "http://bl.clex.ch/frontend/versions/pdf_file_with_annex/1337?locale=de"
[2,] "pdf_link_tol" "http://bl.clex.ch/frontend/versions/1337/download_pdf_file?locale=de"
[3,] "pdf_link_annexes" "http://bl.clex.ch/frontend/structured_documents/3473/download_pdf_annex?locale=de"
[4,] "pdf_link_tol_with_annexes" "http://bl.clex.ch/frontend/structured_documents/3473/download_pdf_file_and_annex?locale=de"
[5,] "pdf_link" "http://bl.clex.ch/frontend/change_document_file_dictionaries/194/download_pdf_file?locale=de"
Upvotes: 1