Reputation: 39
Being relative new to R web-scraping I am hoping for some help with a web-scraping project issue. I am wanting to scrape the data that generates the chart on this page.
I have inspected the page in Chrome and identified the link that returns the data.
Using this URL I have created the following code to parse the data
url <- 'https://www.solactive.com/Indices/?indexhistory=DE000SL0BBT0&indexhistorytype=max'
index_data <- read_xml(url)
Unfortunately I am receiving the error message
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, :
Failed to parse text
I have inspected the webpage that has the following
Response Headers
content-encoding: gzip
content-length: 20624
content-type: text/html; charset=UTF-8
date: Thu, 21 Apr 2022 00:33:05 GMT
server: nginx
strict-transport-security: max-age=63072000
vary: Accept-Encoding
Accept Headers (snapshot)
accept: application/json, text/javascript, */*; q=0.01
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
I have also tried to apply the following encoding with no success
index_data <- read_xml(url, encoding = "gzip, deflate, br")
What I am after is a data table with index_id, date, value
Any assistance would be appreciated.
Thank you
Upvotes: 1
Views: 110
Reputation: 84465
Not sure why in R, despite setting various headers the response remains html, whereas with Python it is sufficient only to pass the referrer header and get JSON back. However, bit of a faff and you can extract from a p
tag in the response and parse with jsonlite
library(httr2)
library(rvest)
headers = c('referer' = 'https://www.solactive.com/Indices/?index=DE000SL0BBT0')
params = list('indexhistory' = 'DE000SL0BBT0', 'indexhistorytype' = 'max')
data <- request("https://www.solactive.com/Indices/") |>
(\(x) req_headers(x, !!!headers))() |>
req_url_query(!!!params) |>
req_perform() |>
resp_body_html() |>
html_element('p') |>
html_text() |>
jsonlite::parse_json(simplifyVector = T)
Upvotes: 2