Ally
Ally

Reputation: 23

Web Scraping in R | Unable to extract information under a certain node using rvest

I'm trying to extract a bit of information under the node /html/head/script[16] from a website (here) but am unable to do so.

nykaa <- "https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934"

obj <- read_html(nykaa)

extracted_json <- obj %>% 
  html_nodes(xpath = "/html/head/script[16]") %>% 
  html_text(trim = TRUE)

Currently, my output for the above code is null. But I would like to extract the data under the above mentioned node in an organized manner.

Upvotes: 2

Views: 68

Answers (1)

QHarr
QHarr

Reputation: 84465

You can use regex to grab the javascript object inside that script tag and then pass to jsonlite and parse. You need to root around a bit to get what you want from that but it is all there

library(rvest)
library(magrittr)
library(stringr)
library(jsonlite)

p <- read_html('https://www.nykaa.com/biotique-bio-kelp-protein-shampoo-for-falling-hair-intensive-hair-growth-treatment-conf/p/357142?categoryId=1292&productId=357142&ptype=product&skuId=39934') %>% html_text()
all_data <- jsonlite::parse_json(str_match_all(p,'window\\.__PRELOADED_STATE__ = (.*)')[[1]][,2])

Upvotes: 1

Related Questions