Reputation: 71
I'm trying to collect information using rvest package in R. While collecting the data with for loop, I found some of the pages do not contain information so that it comes out an error: Error in open.connection(x, "rb") : HTTP error 404.
Here is my R code. The page number 15138 and 15140 do have information, whereas 15139 does not. How can I skip 15139 with for loop function?
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(stringi)
source_url <- "https://go2senkyo.com/local/senkyo/"
senkyo <- data.frame()
for (i in 15138:15140) {
Sys.sleep(0.5)
target_page <- paste0(source_url, i)
recall_html <- read_html(target_page, encoding = "UTF-8")
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
city <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl", " " ))]') %>%
html_text()
city <- trimws(gsub("[\r\n]", "", city ))
senkyo2 <- cbind(prefecture, city)
senkyo <- rbind(senkyo , senkyo2)
}
I'm looking forward to your answer!
Upvotes: 0
Views: 433
Reputation: 440
You can handle exceptions a few different ways. I'm a noob
when it comes to scraping, but here are a few options for your situation.
If you know that you don't want the value 15139
, you can remove from the vector of options, like:
for (i in c(15138,15140)) {
Which will completely ignore 1539
when running your loop.
This is basically the same thing as tailoring your loop range, but handles the exception within the loop itself, like:
for (i in 15138:15140) {
Sys.sleep(0.5)
# control statement
if (i == 15139 {
next # moves to next iteration of loop, in this case 15140
}
target_page <- paste0(source_url, i) # not run if i == 15139, since loop skipped to next iteration
This is where I get out of my depth, and constantly reference Advanced-R. Essentially, you can wrap functions like try()
around your potentially buggy code, which can insulate your loop from errors and keep it from breaking, and gives you flexibility about what to do if your code breaks in specific ways.
My usual approach would be to add something to your code like:
# wrap the part of your code that can break in try()
recall_html <- try(read_html(target_page, encoding = "UTF-8"))
# you'll still see your error, but it won't stop your code, unless you set silent = TRUE
# you'll need to add control flow to keep your loop from breaking at the next function, however
if (class(recall_html) == 'try-error') {
next
} else {
prefecture <- recall_html %>%
html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "column_ttl_small", " " ))]') %>%
html_text()
Upvotes: 1