Reputation: 47
Here is my base url: https://obchody.heureka.sk/?f=1 This contains a list of online shops and I want to scrape their base urs. However urls are not present on this page, because this is treated by java script. I need to perform a click on each online shop logo or name, which redirects me to the home page of the shop and I can get the url from there.
I am having troubles perform the click; I did not succeed with Rvest read_html_live() nor with Selenider elem_click().
My code for Rvest read_html_live():
url <- "https://obchody.heureka.sk/?f=1"
page <- read_html_live(url)
page$click(".c-shops-table__cell--name a", n_clicks = 1)
Code for Selenider
selenider_session()
selenider::open_url(url)
s(".c-shops-table__cell--name a") %>% elem_click(js = TRUE, timeout = NULL)
My expectation is that I should get redirected to the online shop home page; but none of the above works. What am I doing wrong please?
Upvotes: 0
Views: 218
Reputation: 17504
URLs in this example are not mapped / translated with JavaScript, it's server-side HTTP redirection. So you can collect URLs with rvest
, iterate through a list of URLs (with map()
, for example), use httr
/httr2
to make a request while disabling redirection and collect the actual target location from response header:
library(rvest)
library(httr2)
library(purrr)
# collect redirect location from response header
get_redirect <- function(url_, rate = 1){
request(url_) |>
req_options(followlocation = FALSE) |>
req_throttle(1) |>
req_perform() |>
resp_header("Location")
}
urls <-
read_html("https://obchody.heureka.sk/?f=1") |>
html_elements(".c-shops-table__cell--name a") |>
html_attr("href")
head(urls)
#> [1] "https://www.heureka.sk/exit/alza-sk/?z=4"
#> [2] "https://www.heureka.sk/exit/mall-sk/?z=4"
#> [3] "https://www.heureka.sk/exit/datart-sk/?z=4"
#> [4] "https://www.heureka.sk/exit/kaufland-sk/?z=4"
#> [5] "https://www.heureka.sk/exit/andreashop-sk/?z=4"
#> [6] "https://www.heureka.sk/exit/mironetcz-sk/?z=4"
# get redirects for 1st 5 urls
urls[1:5] |>
set_names() |>
map_chr(get_redirect, .progress = TRUE) |>
tibble::enframe(name = "url", value = "redirect")
#> # A tibble: 5 × 2
#> url redirect
#> <chr> <chr>
#> 1 https://www.heureka.sk/exit/alza-sk/?z=4 http://www.alza.sk?hgtid=7d6b6…
#> 2 https://www.heureka.sk/exit/mall-sk/?z=4 http://www.mall.sk?hgtid=da6c5…
#> 3 https://www.heureka.sk/exit/datart-sk/?z=4 https://www.datart.sk/?hgtid=f…
#> 4 https://www.heureka.sk/exit/kaufland-sk/?z=4 https://www.kaufland.sk?hgtid=…
#> 5 https://www.heureka.sk/exit/andreashop-sk/?z=4 http://www.andreashop.sk?hgtid…
Created on 2024-05-22 with reprex v2.1.0
I'd guess your page$click()
fails because your 1st request hits a cookie wall (do check page$view()
) and/or page$click()
does not really work if selector returns multiple matches.
Upvotes: 1