Reputation: 139
I get the error when trying to scrape a news website. I checked, and the website page 32 is broken. I would like to skip the error and keep scraping the rest of the urls.
I have tried the function TryCatch to avoid the broken link, but since I am quite new to R I do not know how to properly write the code. Should I wrap the read_html with that function? If so, how?
url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'
map_df(0:573, function(i) {
pagina <- read_html(sprintf(url_silla, i, '%s', '%s', '%s', '%s'))
print(i)
data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
date = html_text(html_nodes(pagina, ".date.col-sm-3")),
category = html_text(html_nodes(pagina, ".category.col-sm-9")),
tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
stringsAsFactors=FALSE)
}) -> noticias_silla
Here is the error. Thanks a lot for any help!
[1] 31
Error in open.connection(x, "rb") : HTTP error 500.
Called from: open.connection(x, "rb")
Upvotes: 2
Views: 2516
Reputation: 3726
You can use purrr::possibly
:
url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'
library(tidyverse)
library(rvest)
map_df(0:573, possibly(~{
pagina <- read_html(sprintf(url_silla, .x, '%s', '%s', '%s', '%s'))
print(.x)
data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
date = html_text(html_nodes(pagina, ".date.col-sm-3")),
category = html_text(html_nodes(pagina, ".category.col-sm-9")),
tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
stringsAsFactors=FALSE)
}, NULL)) -> noticias_silla
Upvotes: 0
Reputation: 16832
You can build a tryCatch
into a function, then pass that function to map_dfr
. Set it to return NULL
in the event of an error, which won't break the creation of the data frame by map_dfr
.
I'd recommend first trying it with map
instead, so you can investigate how some indices return the data frame you want, and some return NULL
. In either event, the finally
argument will print the index.
library(dplyr)
library(purrr)
library(rvest)
url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'
read_page <- function(i) {
tryCatch(
{
pagina <- read_html(sprintf(url_silla, i, '%s'))
data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
date = html_text(html_nodes(pagina, ".date.col-sm-3")),
category = html_text(html_nodes(pagina, ".category.col-sm-9")),
tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
link = paste0("https://www.lasillavacia.com", trimws(html_attr(html_nodes(pagina, "h3 a"), "href"))),
stringsAsFactors=FALSE)
},
error = function(cond) return(NULL),
finally = print(i)
)
}
noticias <- map_dfr(30:33, read_page)
#> [1] 30
#> [1] 31
#> [1] 32
#> [1] 33
Upvotes: 1
Reputation: 76402
The code below only processes pages numbers 31, 32 and 33.
I am not going to use map_*
to solve the problem I believe that it might make things more difficult than what they are. I am going to use a standard for
loop, since there is no reason why not to.
library(rvest)
library(stringr)
library(tidyverse)
url_silla <- 'https://lasillavacia.com/buscar/farc?page=%d'
pages <- 31:33
noticias_silla <- vector("list", length = length(pages))
for(i in pages){
p <- sprintf(url_silla, i, '%s', '%s', '%s', '%s')
pagina <- tryCatch(read_html(p),
error = function(e) e)
print(i)
if(inherits(pagina, "error")){
noticias_silla[[i - pages[1] + 1]] <- list(page_num = i, page = p)
}else{
noticias_silla[[i - pages[1] + 1]] <- data.frame(titles = html_text(html_nodes(pagina,".col-sm-12 h3")),
date = html_text(html_nodes(pagina, ".date.col-sm-3")),
category = html_text(html_nodes(pagina, ".category.col-sm-9")),
tags = html_text(html_nodes(pagina, ".tags.col-sm-12")),
link = paste0("https://www.lasillavacia.com",str_trim(html_attr(html_nodes(pagina, "h3 a"), "href"))),
stringsAsFactors=FALSE)
}
}
lapply(noticias_silla, class) noticias_silla[[1]]
noticias_silla[[2]]
#[[1]]
#[1] "data.frame"
#
#[[2]]
#[1] "list"
#
#[[3]]
#[1] "data.frame" noticias_silla[[1]]
noticias_silla[[2]]
Note that the second list member is a "list"
, not a "data.frame"
. This is the one where the error occurred.
noticias_silla[[2]]
#$page_num
#[1] 32
#
#$page
#[1] "https://lasillavacia.com/buscar/farc?page=32"
Upvotes: 0