NBE
NBE

Reputation: 651

web scraping multiple web pages with different directory strings in r with rvest

I know there are a lot questions similar to this but I haven't seemed to find one that ask this (Please forgive me if I am wrong). I am trying to scrape a website for weather data and I was successful at doing so for one of the web pages. However, I would like to loop the process. I have looked at enter link description here enter link description here

but I don't believe they solve my problem..

The directory changes slightly at the end from http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=avgtto

  http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=pcpn

and so on.. How could I loop through them even though they aren't increasing by numbers?

Code:

nj_weather_data<-read_html("http://climate.rutgers.edu/stateclim_v1/nclimdiv/")
### Get info you want from web page###
hurr<-html_nodes(nj_weather_data,"#climdiv_table")
### Extract info and turn into dataframe###
precip_table<-as.data.frame(html_table(hurr))%>%
  select(-Rank)

Upvotes: 0

Views: 226

Answers (1)

Roman Luštrik
Roman Luštrik

Reputation: 70643

Assuming you want average T, minimum T, precipitation... Look at the way url changes when you click either in the table above the temperature table. This is done through javascript and in order to obtain that, you would have to load the page through some sort of (headless) browser such as phantomJS.

Another way is to just get the names for individual page and append it to the url and load the data.

library(rvest)

# notice the %s at the end - this is replaced by elements of cs in sprintf
# statement below
x <- "http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=%s"
cs <- c("mint", "avgt", "pcpn", "hdd", "cdd")

# you could paste together new url using paste, too
customstat <- sprintf(x, cs) # %s is replaced with mint, avgt...

# prepare empty object for results
out <- vector("list", length(customstat))
names(out) <- cs

# get individual table and insert it into the output
for (i in customstat) {
  out[[which(i == customstat)]] <- read_html(i) %>%
    html_nodes("#climdiv_table") %>%
    html_table() %>%
    .[[1]]
}

> str(out)
List of 5
 $ mint:'data.frame':   131 obs. of  15 variables:
  ..$ Rank  : logi [1:131] NA NA NA NA NA NA ...
  ..$ Year  : chr [1:131] "1895" "1896" "1897" "1898" ...
  ..$ Jan   : chr [1:131] "18.1" "18.6" "18.7" "23.2" ...
  ..$ Feb   : chr [1:131] "11.7" "20.7" "22.5" "22.1" ...

You can now glue together tables (e.g. using do.call(rbind, out)) or whatever it is required for your analysis.

Upvotes: 1

Related Questions