How to deal with xml_missing when using Rvest and html_element?

Question

I'd like to make a list out of every election date listed here: https://voterportal.sos.la.gov/static/ so that I can then travel to each respective election site and download and compile the spreadsheets called "Excel - Complete Results"

Normally I'd go about this by using Rvest to get every date listed on the linked site and then map over the dates to get to each elections sites (just the election date appended to the parent site url like: "https://voterportal.sos.la.gov/static/2022-04-30") and then read in the excels that are linked in the election sites but I'm running into a problem with html_elements that I haven't encountered before:

I tried to use html_elements to pull the dates:

la_elections_url <- "https://voterportal.sos.la.gov/static/"

la_elections_text <- read_html(la_elections_url)

la_elections_text %>% html_element("a")

Which I thought I'd be able to filter to the href attributes like:

html_attr(html_nodes(la_elections_text, "a"), "href") %>% as.list()

To get a list of the election dates but I get the warning:

la_elections_text %>% html_element("a")

{xml_missing}

Matt · Accepted Answer

This website uses XHR to load data, which makes using rvest based on the DOM a bit trickier. Luckily, you can use DevTools to grab the URL to fetch all of the data yourself:

Using httr, this becomes pretty easy:

library(httr)
library(tidyverse)


res <- httr::GET('https://voterportal.sos.la.gov/ElectionResults/ElectionResults/Data?blob=ElectionDates.htm')

res_list <- httr::content(res)

res_list$Dates$Date %>% 
  purrr::map( ~ {
    .x$ElectionDate
  })

Which gives you:

[[1]]
[1] "04/29/2023"

[[2]]
[1] "03/25/2023"

[[3]]
[1] "02/18/2023"

[[4]]
[1] "01/14/2023"

[[5]]
[1] "12/10/2022"

.....

How to deal with xml_missing when using Rvest and html_element?

Answers (1)

Related Questions