Extract data from HTML page in R: option values in specific select elements

Question

I'm just getting my feet wet with extracting data from a website with R. The EIA has a webpage that provides interactive access to their data, and I would like to extract the range of years for which data is available.

I would like to extract the values for the options, but only for a specific select element (named "year1") on the webpage. How can I do this?


  Start Year:

I've gotten as far as downloading the page and extracting all option values on the page, but am stuck with trying to extract only those options within the "year1" select element.

library(XML)
webpage <- readLines("http://www.eia.gov/cfapps/ipdbproject/IEDIndex3.cfm?tid=2&pid=2&aid=12")
htmlpage <- htmlParse(webpage, asText = TRUE)
pageoptions <- xpathSApply(htmlpage, "//option", function(u) xmlAttrs(u)["value"])

Which gives:

head(pageoptions)

value     value     value     value     value     value 
"regions"    "2012"    "2011"    "2010"    "2009"    "2008"

As you can see, another select list included.

So, how do I get just those 2008 - 2012 values, assuming that the page structure remains constant but the date ranges available may change over time?

Thank you.

Edit

The accepted answer works with the following code:

year <- c(NA_integer_, NA_integer_)
startline <- grep(pattern = "XMLinclude.*syid=", x = webpage, value = FALSE)
year[1] <- sub(pattern = "^.*syid=(.*)&eyid.*", replacement = "\1", x = webpage[startline])
year[2] <- sub(pattern = "^.*eyid=(.*)&form.*", replacement = "\1", x = webpage[startline])

Profiling, there's a big difference in memory allocation, where xml_func is jdharrison's solution, url_func is hvollmeier's solution and noxml_func is a third solution that I thought of while sleeping on the problem (using grep to find the start of the select control and then a while loop to iterate through the option values until the end of select is found and pulling out values using gsub):

   time  alloc release  dups                        ref                     src
1 0.001  0.392       0     0 .active-rstudio-document#7 wrapper_func/noxml_func
2 0.019 13.853       0 12332 .active-rstudio-document#8 wrapper_func/xml_func  
3 0.001  0.000       0   129 .active-rstudio-document#9 wrapper_func/url_func

hvollmeier · Accepted Answer

@Tom, even better and much more stable, instead of scraping the page download the data as a excel file and do whatever you want :-). ( see the excel link on the page? when you inspect the element you can figure out the url of the excel xls-file )

url="http://www.eia.gov/cfapps/ipdbproject/XMLinclude_3.cfm?tid=2&pid=2&pdid=&aid=12&cid=regions&syid=2008&eyid=2012&form=&defaultid=3&typeOfUnit=STDUNIT&unit=BKWH&products="

download the file and save it:

download.file(url,"eiafile.xls")

Extract data from HTML page in R: option values in specific select elements

Answers (2)

Related Questions