Reputation: 65
I am attempting to scrape the webpage https://www.filmweb.no/kinotoppen/ for title and other information under each movie. For other webpages I have been fine with running a couple of lines with html_nodes() and html_text() using SelectorGadget to pick the CSS selectors to get the different things I wanted as such:
html <- read_html("https://www.filmweb.no/kinotoppen/")
title <- html %>%
html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>%
html_text()
However, when running those lines on this webpage I only get an empty character vector. Upon inspecting the webpage further I see that it is calling on javascripts. I tried using html_nodes("script") together with the v8 library to run the javascripts, but to no avail. I'm also unsure which scripts to run, so I tried all as such:
ct <- v8()
ct$eval(scripts[3])
Is there an easier way in general to get the webpage into a form where I can just use rvest? I do not know anything about javascript.
Upvotes: 1
Views: 2726
Reputation: 84455
Data is dynamically retrieved from a graphql query. You can replicate that query to get the JSON response containing all the desired data.
In this case I chose to look at using httr2 and the newish pipe operator (R 4.1.0)
For how to pipe the headers vector I looked at the solution given by @MrFlick here.
library(httr2)
headers = c(
'Accept' = 'application/json',
'Referer' = 'https://www.filmweb.no/',
'Content-Type' = 'application/json',
'User-Agent' = 'Mozilla/5.0'
)
params = list(
'query' = 'query($date:String,$chartType:String,$max:Int){movieQuery{getMovieChart(date:$date,chartType:$chartType,max:$max){chartType periodStart periodEnd movieChartItem{pos posPrev admissions admissionsPrev admissionsToDate weeksOnList movie{title mainVersionId premiere poster{name versions{width height url}}}}}}}',
'variables' = '{"date":"2022-02-04","chartType":"weekend","max":1000}'
)
data <- request("https://skynet.filmweb.no/MovieInfoQs/graphql/") |>
(\(x) req_headers(x, !!!headers))() |>
req_url_query(!!!params) |>
req_perform() |>
resp_body_json()
Upvotes: 2
Reputation: 21757
Here's what it would look like using RSelenium to get the page to load.
library(rvest)
library(RSelenium)
remDr <- rsDriver(browser='chrome', port=4444L)
brow <- remDr[["client"]]
brow$open()
brow$navigate("https://www.filmweb.no/kinotoppen/")
h <- brow$getPageSource()
h <- read_html(h[[1]])
h %>% html_nodes(".Kinotoppen_MovieTitle__2MFbT") %>%
html_text()
# [1] "Spider-Man: No Way Home" "Clifford: Den store røde hunden" "Lise & Snøpels - Venner for alltid"
# [4] "Familien Voff - alle trenger en venn" "Nightmare Alley" "Snødronningen"
# [7] "Scream" "Bergman Island" "Trøffeljegerne fra Piemonte"
# [10] "Encanto"
Upvotes: 4