Sebastian
Sebastian

Reputation: 11

Scraping website with rvest ask geografical location

I'm starting scraping some websites from argentina. I want to scrape this particular websites : "https://www.disco.com.ar/prod/88953/aderezo-mayonesa-natura-237-gr" or "https://www.disco.com.ar/prod/416680/cerveza-rubia-brahma-chopp-1-l-botella-retornable"

I use the package "rvest" for recopile prices and names of other websites. I'm trying to get the URL using the next code:

 library (rvest)
    url_1 <- "https://www.disco.com.ar/prod/88953/aderezo-mayonesa-natura-237-gr"
    page <- read_html (url_1)

I want to scratch the entire page, with the price and the name of those particular products. My problem is that rvest only takes the first window before someone clicks on the location question that appears in chrome. Once you click on "allow" or "not allow", chrome lets me access all the html information. I attach the reference photos, I want to access the product and I can only access the first window with the logo.

How can I make the information accessible via get_html? Do I have to use beautifulsoup or something?

Any help is more than welcome and I thank the entire community.

Upvotes: 0

Views: 98

Answers (1)

Bertrand Martel
Bertrand Martel

Reputation: 45432

You need to make a call to :

POST /Geolocalizacion/Geolocalizacion.aspx/GuardarLocalizacion 

and save the cookies to your html_session. The product information is located in JSON in an input tag with name hfProductData under the value attribute :

library(rvest)
library(httr)
library(jsonlite)

r <- POST("https://www.disco.com.ar/Geolocalizacion/Geolocalizacion.aspx/GuardarLocalizacion", 
    content_type("application/json"),
    body = toJSON(
        list(
            latitud = NA,
            longitud = NA,
            noLocalizar = TRUE
        ), auto_unbox = TRUE
    ),encode = "json")

cookieList <- cookies(r) 
cookies <- cookieList$value %>% setNames(cookieList$name) 

url <- "https://www.disco.com.ar/prod/88953/aderezo-mayonesa-natura-237-gr"

resp <- html_session(url, set_cookies(cookies)) %>% 
    html_nodes('input[name="hfProductData"]') %>%
    html_attr("value")

print(fromJSON(resp))

Output :

$DescripcionArticulo
[1] "Aderezo Mayonesa Natura 237 Gr"
$Grupo_Marca
[1] "NATURA"
$IdArchivoZoom
[1] ""
$IdArchivoBig
[1] "444812.jpg"
$IdArchivoSmall
[1] "444664.jpg"
$IdArticulo
[1] 88953
$Precio
[1] "49.52"
$unidadPedida
[1] "Un"
$Pesable
[1] "False"
$Stock
[1] "84.00"
$CucardaOferta
[1] ""
$Descuentos
list()
$ImgMxM
[1] "11510117005.jpg"
$Codigo
[1] "11510117005"
$Categoria
[1] "Almacén->Aderezos->Mayonesas"

Upvotes: 0

Related Questions