Tom
Tom

Reputation: 714

Extract data from HTML page in R: option values in specific select elements

I'm just getting my feet wet with extracting data from a website with R. The EIA has a webpage that provides interactive access to their data, and I would like to extract the range of years for which data is available.

I would like to extract the values for the options, but only for a specific select element (named "year1") on the webpage. How can I do this?

<span id="sub3">
  <label for="year">Start Year:</label>
  <select name="year1" id="year" style="font-size:12px;padding:4px 2px;border:solid 1px #aacfe4;" onchange="activeB()">
    <option value="2012">2012</option>
    <option value="2011">2011</option>
    <option value="2010">2010</option>
    <option value="2009">2009</option>
    <option value="2008" selected="selected">2008</option>
    <option value="2007">2007</option>
    <option value="2006">2006</option>
    <option value="2005">2005</option>
    <option value="2004">2004</option>
    <option value="2003">2003</option>
    <option value="2002">2002</option>
    <option value="2001">2001</option>
    <option value="2000">2000</option>
    <option value="1999">1999</option>
    <option value="1998">1998</option>
    <option value="1997">1997</option>
    <option value="1996">1996</option>
    <option value="1995">1995</option>
    <option value="1994">1994</option>
    <option value="1993">1993</option>
    <option value="1992">1992</option>
    <option value="1991">1991</option>
    <option value="1990">1990</option>
    <option value="1989">1989</option>
    <option value="1988">1988</option>
    <option value="1987">1987</option>
    <option value="1986">1986</option>
    <option value="1985">1985</option>
    <option value="1984">1984</option>
    <option value="1983">1983</option>
    <option value="1982">1982</option>
    <option value="1981">1981</option>
    <option value="1980">1980</option>                                              
  </select>
</span>

I've gotten as far as downloading the page and extracting all option values on the page, but am stuck with trying to extract only those options within the "year1" select element.

library(XML)
webpage <- readLines("http://www.eia.gov/cfapps/ipdbproject/IEDIndex3.cfm?tid=2&pid=2&aid=12")
htmlpage <- htmlParse(webpage, asText = TRUE)
pageoptions <- xpathSApply(htmlpage, "//option", function(u) xmlAttrs(u)["value"])

Which gives:

head(pageoptions)

value     value     value     value     value     value 
"regions"    "2012"    "2011"    "2010"    "2009"    "2008" 

As you can see, another select list included.

So, how do I get just those 2008 - 2012 values, assuming that the page structure remains constant but the date ranges available may change over time?

Thank you.

Edit

The accepted answer works with the following code:

year <- c(NA_integer_, NA_integer_)
startline <- grep(pattern = "XMLinclude.*syid=", x = webpage, value = FALSE)
year[1] <- sub(pattern = "^.*syid=(.*)&eyid.*", replacement = "\\1", x = webpage[startline])
year[2] <- sub(pattern = "^.*eyid=(.*)&form.*", replacement = "\\1", x = webpage[startline])

Profiling, there's a big difference in memory allocation, where xml_func is jdharrison's solution, url_func is hvollmeier's solution and noxml_func is a third solution that I thought of while sleeping on the problem (using grep to find the start of the select control and then a while loop to iterate through the option values until the end of select is found and pulling out values using gsub):

   time  alloc release  dups                        ref                     src
1 0.001  0.392       0     0 .active-rstudio-document#7 wrapper_func/noxml_func
2 0.019 13.853       0 12332 .active-rstudio-document#8 wrapper_func/xml_func  
3 0.001  0.000       0   129 .active-rstudio-document#9 wrapper_func/url_func  

Upvotes: 2

Views: 1896

Answers (2)

hvollmeier
hvollmeier

Reputation: 2986

@Tom, even better and much more stable, instead of scraping the page download the data as a excel file and do whatever you want :-). ( see the excel link on the page? when you inspect the element you can figure out the url of the excel xls-file )

url="http://www.eia.gov/cfapps/ipdbproject/XMLinclude_3.cfm?tid=2&pid=2&pdid=&aid=12&cid=regions&syid=2008&eyid=2012&form=&defaultid=3&typeOfUnit=STDUNIT&unit=BKWH&products="

download the file and save it:

download.file(url,"eiafile.xls")

Upvotes: 2

jdharrison
jdharrison

Reputation: 30425

Include an additional filter on span[@id='sub3'] to narrow the search down

library(XML)
webpage <- readLines("http://www.eia.gov/cfapps/ipdbproject/IEDIndex3.cfm?tid=2&pid=2&aid=12")
htmlpage <- htmlParse(webpage, asText = TRUE)
pageoptions <- xpathSApply(htmlpage, "//span[@id='sub3']/*/option", function(u) xmlAttrs(u)["value"])

> pageoptions
value  value  value  value  value  value  value  value  value  value 
"2012" "2011" "2010" "2009" "2008" "2007" "2006" "2005" "2004" "2003" 
value  value  value  value  value  value  value  value  value  value 
"2002" "2001" "2000" "1999" "1998" "1997" "1996" "1995" "1994" "1993" 
value  value  value  value  value  value  value  value  value  value 
"1992" "1991" "1990" "1989" "1988" "1987" "1986" "1985" "1984" "1983" 
value  value  value 
"1982" "1981" "1980" 

"//select[@name='year1']/option" as your xpath would also work

Upvotes: 5

Related Questions