Noah Olsen
Noah Olsen

Reputation: 281

scraping an interactive table in R with rvest

I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm

I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows:

url <- "http://proximityone.com/cd114_2013_2014.htm" 
table <- gis_data_html %>%
html_node(xpath = '//span') %>%
html_table()

Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing""

Anyone know what I would need to change to scrape the interactive table in the link?

Upvotes: 2

Views: 2383

Answers (1)

Mark
Mark

Reputation: 4537

So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see

<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0" 
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)" 
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)" 
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>

but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.

You have a couple of options to deal with this. The 'easy' one is to use RSelenium and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.

UPDATE

Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv

The .csv doesn't have column headers, but those are in the <script> as well

var columns = [
"FirstNnme",
"LastName",
"Party",
"Feature",
"St",
"CD",
"State<br>CD",
"State<br>CD",
"Population<br>2013", 
"Population<br>2014", 
"PopCh<br>2013-14", 
"%PopCh<br>2013-14", 
"MHI<br>2013", 
"MHI<br>2014", 
"MHI<br>Change<br>2013-14", 
"%MHI<br>Change<br>2013-14", 
"MFI<br>2013", 
"MFI<br>2014", 
"MFI<br>Change<br>2013-14", 
"%MFI<br>Change<br>2013-14", 
"MHV<br>2013", 
"MHV<br>2014", 
"MHV<br>Change<br>2013-14", 
"%MHV<br>Change<br>2013-14", 

]

Programmatic Solution

Instead of digging through the javacript (in case there are several such pages on this website you want) you can attempt this pro programmatically too. We read the page, get the <script> notes, get the "text" (the script itself) and look for references to a csv file. Then we expand out the relative URL and read it in. This doesn't help with column names, but shouldn't be too hard to extract that too.

library(rvest)
page = read_html("http://proximityone.com/cd114_2013_2014.htm")
scripts = page %>% 
  html_nodes("script") %>% 
  html_text() %>% 
  grep("\\.csv",.,value=T)
relCSV = stringr::str_extract(scripts,"\\.\\./.*?csv")
fullCSV = gsub("\\.\\.","http://proximityone.com",relCSV)
data = read.csv(fullCSV,header = FALSE)

Upvotes: 6

Related Questions