Reputation: 714
I'm just getting my feet wet with extracting data from a website with R. The EIA has a webpage that provides interactive access to their data, and I would like to extract the range of years for which data is available.
I would like to extract the values for the options, but only for a specific select element (named "year1") on the webpage. How can I do this?
<span id="sub3">
<label for="year">Start Year:</label>
<select name="year1" id="year" style="font-size:12px;padding:4px 2px;border:solid 1px #aacfe4;" onchange="activeB()">
<option value="2012">2012</option>
<option value="2011">2011</option>
<option value="2010">2010</option>
<option value="2009">2009</option>
<option value="2008" selected="selected">2008</option>
<option value="2007">2007</option>
<option value="2006">2006</option>
<option value="2005">2005</option>
<option value="2004">2004</option>
<option value="2003">2003</option>
<option value="2002">2002</option>
<option value="2001">2001</option>
<option value="2000">2000</option>
<option value="1999">1999</option>
<option value="1998">1998</option>
<option value="1997">1997</option>
<option value="1996">1996</option>
<option value="1995">1995</option>
<option value="1994">1994</option>
<option value="1993">1993</option>
<option value="1992">1992</option>
<option value="1991">1991</option>
<option value="1990">1990</option>
<option value="1989">1989</option>
<option value="1988">1988</option>
<option value="1987">1987</option>
<option value="1986">1986</option>
<option value="1985">1985</option>
<option value="1984">1984</option>
<option value="1983">1983</option>
<option value="1982">1982</option>
<option value="1981">1981</option>
<option value="1980">1980</option>
</select>
</span>
I've gotten as far as downloading the page and extracting all option values on the page, but am stuck with trying to extract only those options within the "year1" select element.
library(XML)
webpage <- readLines("http://www.eia.gov/cfapps/ipdbproject/IEDIndex3.cfm?tid=2&pid=2&aid=12")
htmlpage <- htmlParse(webpage, asText = TRUE)
pageoptions <- xpathSApply(htmlpage, "//option", function(u) xmlAttrs(u)["value"])
Which gives:
head(pageoptions)
value value value value value value
"regions" "2012" "2011" "2010" "2009" "2008"
As you can see, another select
list included.
So, how do I get just those 2008 - 2012 values, assuming that the page structure remains constant but the date ranges available may change over time?
Thank you.
Edit
The accepted answer works with the following code:
year <- c(NA_integer_, NA_integer_)
startline <- grep(pattern = "XMLinclude.*syid=", x = webpage, value = FALSE)
year[1] <- sub(pattern = "^.*syid=(.*)&eyid.*", replacement = "\\1", x = webpage[startline])
year[2] <- sub(pattern = "^.*eyid=(.*)&form.*", replacement = "\\1", x = webpage[startline])
Profiling, there's a big difference in memory allocation, where xml_func is jdharrison's solution, url_func is hvollmeier's solution and noxml_func is a third solution that I thought of while sleeping on the problem (using grep to find the start of the select
control and then a while loop to iterate through the option values until the end of select
is found and pulling out values using gsub
):
time alloc release dups ref src
1 0.001 0.392 0 0 .active-rstudio-document#7 wrapper_func/noxml_func
2 0.019 13.853 0 12332 .active-rstudio-document#8 wrapper_func/xml_func
3 0.001 0.000 0 129 .active-rstudio-document#9 wrapper_func/url_func
Upvotes: 2
Views: 1896
Reputation: 2986
@Tom, even better and much more stable, instead of scraping the page download the data as a excel file and do whatever you want :-). ( see the excel link on the page? when you inspect the element you can figure out the url of the excel xls-file )
url="http://www.eia.gov/cfapps/ipdbproject/XMLinclude_3.cfm?tid=2&pid=2&pdid=&aid=12&cid=regions&syid=2008&eyid=2012&form=&defaultid=3&typeOfUnit=STDUNIT&unit=BKWH&products="
download the file and save it:
download.file(url,"eiafile.xls")
Upvotes: 2
Reputation: 30425
Include an additional filter on span[@id='sub3']
to narrow the search down
library(XML)
webpage <- readLines("http://www.eia.gov/cfapps/ipdbproject/IEDIndex3.cfm?tid=2&pid=2&aid=12")
htmlpage <- htmlParse(webpage, asText = TRUE)
pageoptions <- xpathSApply(htmlpage, "//span[@id='sub3']/*/option", function(u) xmlAttrs(u)["value"])
> pageoptions
value value value value value value value value value value
"2012" "2011" "2010" "2009" "2008" "2007" "2006" "2005" "2004" "2003"
value value value value value value value value value value
"2002" "2001" "2000" "1999" "1998" "1997" "1996" "1995" "1994" "1993"
value value value value value value value value value value
"1992" "1991" "1990" "1989" "1988" "1987" "1986" "1985" "1984" "1983"
value value value
"1982" "1981" "1980"
"//select[@name='year1']/option"
as your xpath would also work
Upvotes: 5