piotr
piotr

Reputation: 152

If error than in R

I am downloading weather data from web. To do so I have created simple for loop which adds dataframes with data into lists (one list for one city). It works fine but if there is no data (no table with weather condition for particular date on web) it returns an error - for example to this url ("https://www.wunderground.com/history/airport/EPLB/2015/12/25/DailyHistory.html?req_city=Abramowice%20Koscielne&req_statename=Poland").

Error in Lublin[i] <- url4 %>% read_html() %>% html_nodes(xpath = "//*[@id=\"obsTable\"]") %>%  : 
  replacement has length zero

How can I put if statement which returns row with NA's (13 observations) when error happens and puts it into list?

Also is there a faster way to download all the data than in for loop?

My code:

c<-seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days")
Warszawa <- list()
Wroclaw <- list()
Bydgoszcz <- list()
Lublin <- list()
Gorzow <- list()
Lodz <- list()
Krakow <- list()
Opole <- list()
Rzeszow <- list()
Bialystok <- list()
Gdansk <- list()
Katowice <- list()
Kielce <- list()
Olsztyn <- list()
Poznan <- list()
Szczecin <- list()
date <- list()
for(i in 1:length(c)) {
y<-as.numeric(format(c[i],'%Y'))
m<-as.numeric(format(c[i],'%m'))
d<-as.numeric(format(c[i],'%d'))
date[i] <- c[i]
url1 <- sprintf("https://www.wunderground.com/history/airport/EPWA/%d/%d/%d/DailyHistory.html?req_city=Warszawa&req_state=MZ&req_statename=Poland", y, m, d)
url2 <- sprintf("https://www.wunderground.com/history/airport/EPWR/%d/%d/%d/DailyHistory.html?req_city=Wrocław&req_statename=Poland", y, m, d)
url3 <- sprintf("https://www.wunderground.com/history/airport/EPBY/%d/%d/%d/DailyHistory.html?req_city=Bydgoszcz&req_statename=Poland", y, m, d)
url4 <- sprintf("https://www.wunderground.com/history/airport/EPLB/%d/%d/%d/DailyHistory.html?req_city=Abramowice%%20Koscielne&req_statename=Poland", y, m, d)
url5 <- sprintf("https://www.wunderground.com/history/airport/EPZG/%d/%d/%d/DailyHistory.html?req_city=Gorzow%%20Wielkopolski&req_statename=Poland", y, m, d)
url6 <- sprintf("https://www.wunderground.com/history/airport/EPLL/%d/%d/%d/DailyHistory.html?req_city=Lodz&req_statename=Poland", y, m, d)
url7 <- sprintf("https://www.wunderground.com/history/airport/EPKK/%d/%d/%d/DailyHistory.html?req_city=Krakow&req_statename=Poland", y, m, d)
url8 <- sprintf("https://www.wunderground.com/history/airport/EPWR/%d/%d/%d/DailyHistory.html?req_city=Opole&req_statename=Poland", y, m, d)
url9 <- sprintf("https://www.wunderground.com/history/airport/EPRZ/%d/%d/%d/DailyHistory.html?req_city=Rzeszow&req_statename=Poland", y, m, d)
url10 <- sprintf("https://www.wunderground.com/history/airport/UMMG/%d/%d/%d/DailyHistory.html?req_city=Dojlidy&req_statename=Poland", y, m, d)
url11 <- sprintf("https://www.wunderground.com/history/airport/EPGD/%d/%d/%d/DailyHistory.html?req_city=Gdansk&req_statename=Poland", y, m, d)
url12 <- sprintf("https://www.wunderground.com/history/airport/EPKM/%d/%d/%d/DailyHistory.html?req_city=Katowice&req_statename=Poland", y, m, d)
url13 <- sprintf("https://www.wunderground.com/history/airport/EPKT/%d/%d/%d/DailyHistory.html?req_city=Chorzow%%20Batory&req_statename=Poland", y, m, d)
url14 <- sprintf("https://www.wunderground.com/history/airport/EPSY/%d/%d/%d/DailyHistory.html", y, m, d)
url15 <- sprintf("https://www.wunderground.com/history/airport/EPPO/%d/%d/%d/DailyHistory.html?req_city=Poznan%%20Old%%20Town&req_statename=Poland", y, m, d)
url16 <- sprintf("https://www.wunderground.com/history/airport/EPSC/%d/%d/%d/DailyHistory.html?req_city=Szczecin&req_statename=Poland", y, m, d)

Warszawa[i] <- url1 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Wroclaw[i] <- url2 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Bydgoszcz[i] <- url3 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Lublin[i] <- url4 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Gorzow[i] <- url5 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Lodz[i] <- url6 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Krakow[i] <- url7 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Opole[i] <- url8 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Rzeszow[i] <- url9 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Bialystok[i] <- url10 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Gdansk[i] <- url11 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Katowice[i] <- url12 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Kielce[i] <- url13 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Olsztyn[i] <- url14 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Poznan[i] <- url15 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Szczecin[i] <- url16 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()

}

Thanks for help.

Upvotes: 1

Views: 1541

Answers (2)

KenHBS
KenHBS

Reputation: 7164

First off, I got a little bit carried away and the answer is a bit longer than planned initially. I decided to help you out with three issues: the repetition issue in identifying valid URLs; repetition issue in getting relevant info for those URLs; and also the error problem while scraping.

So here we go, you would want to get the links you wanna scrape in a bit simpler manner:

library(httr)
library(rvest)

## All the dates:
 dates <- seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days")
 dates <- gsub("-", "/", x = dates)

## All the regions and links:
 abbreviations <- c("EPWA", "EPWR", "EPBY", "EPLB", "EPZG", "EPLL", "EPKK",          
                      "EPWR", "EPRZ", "UMMG", "EPGD", "EPKM", "EPKT",
                      "EPSY", "EPPO", "EPSC")

links <- paste0("https://www.wunderground.com/history/airport/", 
                abbreviations, "/")
links <- lapply(links, function(x){paste0(x, dates, "/DailyHistory.html")})

Now that we have all the links in links, we will define a function that will check the links and scrape the HTMLs and get whatever info we want. In your case, that would be: city name, date and the weather table. I decided to use city name and date as the name of the object, so you can easily which weather table belongs to which city and date:

## Get the weather report & name 
get_table <- function(link){
  # Get the html from a link
   html <- try(link %>%
             read_html())
   if("try-error)" %in% class(html)){
         print("HTML not found, skipping to next link")
         return("HTML not found, skipping to next link")
   }

   # Get the weather table from that page
   weather_table <- html %>%
     html_nodes(xpath='//*[@id="obsTable"]') %>%
     html_table()
   if(length(weather_table) == 0){
     print("No weather table available for this day")
     return("No weather table available for this day")
   }

   # Use info from the html to get the city, for naming the list
   region <- html %>%
     html_nodes(xpath = '//*[@id="location"]') %>%
     html_text()
   region <- strsplit(region, "[1-9]")[[1]][1]
   region <- gsub("\n", "",  region)
   region <- gsub("\t\t", "", region)

   # Use info from the html to get the date, and name the list
   which_date <- html %>%
    html_nodes(xpath = '//*[@class="history-date"]') %>%
    html_text()

   city_date <- paste0(region, which_date)

   # Name the output
   names(weather_table) <- city_date

   print(paste0("Just scraped ", city_date))
   return(weather_table)
 }

Running this function should work for all URLs we identified, including the faulty URL you posted in your question

# A little test-run, to see if your faulty URL works:
  testlink      <- "https://www.wunderground.com/history/airport/EPLB/2015/12/25/DailyHistory.html?req_city=Abramowice%20Koscielne&req_statename=Poland"
  links[[1]][5] <- testlink
  tested        <- sapply(links[[1]][1:6], get_table, USE.NAMES = FALSE)
  # [1] "Just scraped Warsaw, Poland Thursday, January 1, 2015"
  # [1] "Just scraped Warsaw, Poland Friday, January 2, 2015"
  # [1] "Just scraped Warsaw, Poland Saturday, January 3, 2015"
  # [1] "Just scraped Warsaw, Poland Sunday, January 4, 2015"
  # [1] "No weather table available for this day"
  # [1] "Just scraped Warsaw, Poland Tuesday, January 6, 2015"

works like a charm, so you can use the following loop to get the Polish weather data:

# For all sublists in links (corresponding to cities)
# scrape all links (corresponding to days)
city <- rep(list(list()), length(abbreviations))
for(i in 1:length(links)){
  city[[i]] <- sapply(links[[i]], get_table, USE.NAMES = FALSE)
}

Upvotes: 1

ASH
ASH

Reputation: 20302

Since all those URLs are essentially the same thing, with slight, and very predictable difference, why not loop through an array, concatenate everything together, and run that.

Here is an example of what I'm alluding to.

library(rvest)
library(stringr)

#create a master dataframe to store all of the results
complete <- data.frame()

yearsVector <- c("2010", "2011", "2012", "2013", "2014", "2015")
#position is not needed since all of the info is stored on the page
#positionVector <- c("qb", "rb", "wr", "te", "ol", "dl", "lb", "cb", "s")
positionVector <- c("qb")
for (i in 1:length(yearsVector)) {
    for (j in 1:length(positionVector)) {
        # create a url template 
        URL.base <- "http://www.nfl.com/draft/"
        URL.intermediate <- "/tracker?icampaign=draft-sub_nav_bar-drafteventpage-tracker#dt-tabs:dt-by-position/dt-by-position-input:"
        #create the dataframe with the dynamic values
        URL <- paste0(URL.base, yearsVector[i], URL.intermediate, positionVector[j])
        #print(URL)

        #read the page - store the page to make debugging easier
        page <- read_html(URL)

Upvotes: 1

Related Questions