Scrape site that asks for cookies consent with rvest

Question

I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:

library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c") 
content %>% html_text()

The result seems to be the content of the popup window asking for consent.

Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?

Datapumpernickel · Accepted Answer

As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want. This is the request to api.karriere.nrw.

Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.

Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.

library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "johndoe@company.com")

### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"

### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]

### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)

### get results
response <- httr::GET(api_url,
                    httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()

Scrape site that asks for cookies consent with rvest

Answers (2)

Related Questions