phil
phil

Reputation: 191

R Web Scraping: Error handling when web page doesn't contain a table

I'm having some difficulties web scraping. Specifically, I'm scraping web pages that generally have tables embedded. However, for the instances in which there is no embedded table, I can't seem to handle the error in a way that doesn't break the loop.

Example code below:

event = c("UFC 226: Miocic vs. Cormier", "ONE Championship 76: Battle for the Heavens", "Rizin FF 12")
eventLinks = c("https://www.bestfightodds.com/events/ufc-226-miocic-vs-cormier-1447", "https://www.bestfightodds.com/events/one-championship-76-battle-for-the-heavens-1532", "https://www.bestfightodds.com/events/rizin-ff-12-1538")
testLinks = data.frame(event, eventLinks)

for (i in 1:length(testLinks)) {
  print(testLinks$event[i])
  event = tryCatch(as.data.frame(read_html(testLinks$eventLink[i]) %>% html_table(fill=T)),
                   error = function(e) {NA})
}

The second link does not have a table embedded. I thought I'd just skip it with my tryCatch, but instead of skipping it, the link breaks the loop.

What I'm hoping to figure out is a way to skip links with no tables, but continue scraping the next link in the list. To continue using the example above, I want the tryCatch to move from the second link onto the third.

Any help? Much appreciated!

Upvotes: 1

Views: 1123

Answers (1)

stevec
stevec

Reputation: 52258

There are a few things to fix here. Firstly, your links are considered factors (you can see this with testLinks %>% sapply(class), so you'll need to convert them to character using as.chracter() I've done this in the code below.

Secondly, you need to assign each scrape to a list element, so we create a list outside the loop with events <- list(), and then assign each scrape to an element of the list inside the loop i.e. events[[i]] <- "something" Without a list, you'll simply override the first scrape with the second, and the second with the third, and so on.

Now your tryCatch will work and assign NA when a url does not contain a table (there will be no error)

events <- list()
for (i in 1:nrow(testLinks)) {
  print(testLinks$event[i])
  events[[i]] = tryCatch(as.data.frame(read_html(testLinks$eventLink[i] %>% as.character(.)) %>% html_table(fill=T)),
                   error = function(e) {NA})
}

events

Upvotes: 3

Related Questions