sketman
sketman

Reputation: 47

Web scraping using R: read_html_live() - correct css selector to perform click

Here is my base url: https://obchody.heureka.sk/?f=1 This contains a list of online shops and I want to scrape their base urs. However urls are not present on this page, because this is treated by java script. I need to perform a click on each online shop logo or name, which redirects me to the home page of the shop and I can get the url from there.

I am having troubles perform the click; I did not succeed with Rvest read_html_live() nor with Selenider elem_click().

My code for Rvest read_html_live():

url <- "https://obchody.heureka.sk/?f=1"
page <- read_html_live(url)
page$click(".c-shops-table__cell--name a", n_clicks = 1)

Code for Selenider

selenider_session()
selenider::open_url(url)
s(".c-shops-table__cell--name a") %>% elem_click(js = TRUE, timeout = NULL)

My expectation is that I should get redirected to the online shop home page; but none of the above works. What am I doing wrong please?

Upvotes: 0

Views: 218

Answers (1)

margusl
margusl

Reputation: 17504

URLs in this example are not mapped / translated with JavaScript, it's server-side HTTP redirection. So you can collect URLs with rvest, iterate through a list of URLs (with map(), for example), use httr/httr2 to make a request while disabling redirection and collect the actual target location from response header:

library(rvest)
library(httr2)
library(purrr)

# collect redirect location from response header
get_redirect <- function(url_, rate = 1){
  request(url_) |>
    req_options(followlocation = FALSE) |>
    req_throttle(1) |>
    req_perform() |> 
    resp_header("Location")
}

urls <- 
  read_html("https://obchody.heureka.sk/?f=1") |>
  html_elements(".c-shops-table__cell--name a") |>
  html_attr("href")

head(urls)
#> [1] "https://www.heureka.sk/exit/alza-sk/?z=4"      
#> [2] "https://www.heureka.sk/exit/mall-sk/?z=4"      
#> [3] "https://www.heureka.sk/exit/datart-sk/?z=4"    
#> [4] "https://www.heureka.sk/exit/kaufland-sk/?z=4"  
#> [5] "https://www.heureka.sk/exit/andreashop-sk/?z=4"
#> [6] "https://www.heureka.sk/exit/mironetcz-sk/?z=4"

# get redirects for 1st 5 urls
urls[1:5] |>
  set_names() |>
  map_chr(get_redirect, .progress = TRUE) |>
  tibble::enframe(name = "url", value = "redirect")

#> # A tibble: 5 × 2
#>   url                                            redirect                       
#>   <chr>                                          <chr>                          
#> 1 https://www.heureka.sk/exit/alza-sk/?z=4       http://www.alza.sk?hgtid=7d6b6…
#> 2 https://www.heureka.sk/exit/mall-sk/?z=4       http://www.mall.sk?hgtid=da6c5…
#> 3 https://www.heureka.sk/exit/datart-sk/?z=4     https://www.datart.sk/?hgtid=f…
#> 4 https://www.heureka.sk/exit/kaufland-sk/?z=4   https://www.kaufland.sk?hgtid=…
#> 5 https://www.heureka.sk/exit/andreashop-sk/?z=4 http://www.andreashop.sk?hgtid…

Created on 2024-05-22 with reprex v2.1.0

I'd guess your page$click()fails because your 1st request hits a cookie wall (do check page$view() ) and/or page$click() does not really work if selector returns multiple matches.

Upvotes: 1

Related Questions