rvest web scraping with javascript

Question

I am trying to scrape the daily forecast from FiveThirtyEight using rvest, but my object of interest seems to be a javascript object, which I am having difficulty even locating where and what to look for. (I'm not well versed in CSS or Javascript, though I tried to educate myself in the last couple days.)

By inspecting the webpage element and CSS selector, I have figured out the following:

The location to look is

, so I tried

library(rvest)
url <- 
  "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

url %>% 
  read_html() %>% 
  html_nodes("#polling-avg-chart")

without much success. The output is simply

{xml_nodeset (1)}

[1] <\div id="polling-avg-chart">

The individual poll results in dots are in ... , where you see 502 locations in numbers. I'm guessing that I will have to translate cx and cy of each node into the appropriate percentages, which is done by ... and so on.
However I do not see the underlying data for the forecast line, not the dots.
When I let my cursor hover over the chart, I see things such as change, and values such as change, and I'm guessing that these values are what's creating the daily forecast line.
But where these values are stored, and how to translate it back to things like "49.1% Clinton vs. 26.6% Sanders" is still a mystery to me.

I did read a few other SO posts such as this but none of them seemed applicable to this particular problem. What would be the best way to get the forecast percentages in a neat dataframe?

Aur&#232;le · Accepted Answer

Another way is to grab the resource directly.

In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address (right-click on the resource > Copy link address).

library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"

r <- GET(url)

The whole data is there. The weights too, so you can probably recalculate those averages. The data as plotted is in "model":

dat <- 
  jsonlite::fromJSON(content(r, as = "text")) %>% 
  map(purrr::pluck, "model") %>% 
  bind_rows(.id = "party") %>% 
  mutate_all(readr::parse_guess)

# # A tibble: 5,288 x 5
#    party candidate_name state forecastdate poll_avg
#                          
#  1 D     Sanders        USA   2016-07-01       36.5
#  2 D     Clinton        USA   2016-07-01       55.4
#  3 D     Sanders        USA   2016-06-30       37.0
#  4 D     Clinton        USA   2016-06-30       54.6
#  5 D     Sanders        USA   2016-06-29       37.0
#  6 D     Clinton        USA   2016-06-29       54.9
#  7 D     Sanders        USA   2016-06-28       37.2
#  8 D     Clinton        USA   2016-06-28       54.4
#  9 D     Sanders        USA   2016-06-27       37.4
# 10 D     Clinton        USA   2016-06-27       53.9
# # ... with 5,278 more rows

Reproduce graphs:

dat %>% 
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>% 
  ggplot(aes(forecastdate, poll_avg)) +
  geom_line(aes(col = candidate_name)) +
  facet_wrap(~party)

If you'd like interactivity:

library(dygraphs)
library(htmltools)

foo <- dat %>% 
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>% 
  split(.$party) %>% 
  map(~ {
    select(.x, forecastdate, candidate_name, poll_avg) %>% 
      spread(candidate_name, poll_avg) %>% 
      {xts(.[-1], .[[1]])} %>%
      dygraph(group = "poll-model") %>% 
      dyRangeSelector()
  })

browsable(tagList(foo))

rvest web scraping with javascript

Answers (2)

Related Questions