Reputation: 281
I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm
I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows:
url <- "http://proximityone.com/cd114_2013_2014.htm"
table <- gis_data_html %>%
html_node(xpath = '//span') %>%
html_table()
Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing""
Anyone know what I would need to change to scrape the interactive table in the link?
Upvotes: 2
Views: 2383
Reputation: 4537
So the problem you're facing is that rvest
will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see
<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0"
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)"
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)"
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>
but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.
You have a couple of options to deal with this. The 'easy' one is to use RSelenium
and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.
UPDATE
Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv
The .csv doesn't have column headers, but those are in the <script>
as well
var columns = [
"FirstNnme",
"LastName",
"Party",
"Feature",
"St",
"CD",
"State<br>CD",
"State<br>CD",
"Population<br>2013",
"Population<br>2014",
"PopCh<br>2013-14",
"%PopCh<br>2013-14",
"MHI<br>2013",
"MHI<br>2014",
"MHI<br>Change<br>2013-14",
"%MHI<br>Change<br>2013-14",
"MFI<br>2013",
"MFI<br>2014",
"MFI<br>Change<br>2013-14",
"%MFI<br>Change<br>2013-14",
"MHV<br>2013",
"MHV<br>2014",
"MHV<br>Change<br>2013-14",
"%MHV<br>Change<br>2013-14",
]
Programmatic Solution
Instead of digging through the javacript (in case there are several such pages on this website you want) you can attempt this pro programmatically too. We read the page, get the <script>
notes, get the "text" (the script itself) and look for references to a csv file. Then we expand out the relative URL and read it in. This doesn't help with column names, but shouldn't be too hard to extract that too.
library(rvest)
page = read_html("http://proximityone.com/cd114_2013_2014.htm")
scripts = page %>%
html_nodes("script") %>%
html_text() %>%
grep("\\.csv",.,value=T)
relCSV = stringr::str_extract(scripts,"\\.\\./.*?csv")
fullCSV = gsub("\\.\\.","http://proximityone.com",relCSV)
data = read.csv(fullCSV,header = FALSE)
Upvotes: 6