Reputation: 191
I'm having some difficulties web scraping. Specifically, I'm scraping web pages that generally have tables embedded. However, for the instances in which there is no embedded table, I can't seem to handle the error in a way that doesn't break the loop.
Example code below:
event = c("UFC 226: Miocic vs. Cormier", "ONE Championship 76: Battle for the Heavens", "Rizin FF 12")
eventLinks = c("https://www.bestfightodds.com/events/ufc-226-miocic-vs-cormier-1447", "https://www.bestfightodds.com/events/one-championship-76-battle-for-the-heavens-1532", "https://www.bestfightodds.com/events/rizin-ff-12-1538")
testLinks = data.frame(event, eventLinks)
for (i in 1:length(testLinks)) {
print(testLinks$event[i])
event = tryCatch(as.data.frame(read_html(testLinks$eventLink[i]) %>% html_table(fill=T)),
error = function(e) {NA})
}
The second link does not have a table embedded. I thought I'd just skip it with my tryCatch, but instead of skipping it, the link breaks the loop.
What I'm hoping to figure out is a way to skip links with no tables, but continue scraping the next link in the list. To continue using the example above, I want the tryCatch to move from the second link onto the third.
Any help? Much appreciated!
Upvotes: 1
Views: 1123
Reputation: 52258
There are a few things to fix here. Firstly, your links are considered factors (you can see this with testLinks %>% sapply(class)
, so you'll need to convert them to character using as.chracter()
I've done this in the code below.
Secondly, you need to assign each scrape to a list element, so we create a list outside the loop with events <- list()
, and then assign each scrape to an element of the list inside the loop i.e. events[[i]] <- "something"
Without a list, you'll simply override the first scrape with the second, and the second with the third, and so on.
Now your tryCatch will work and assign NA when a url does not contain a table (there will be no error)
events <- list()
for (i in 1:nrow(testLinks)) {
print(testLinks$event[i])
events[[i]] = tryCatch(as.data.frame(read_html(testLinks$eventLink[i] %>% as.character(.)) %>% html_table(fill=T)),
error = function(e) {NA})
}
events
Upvotes: 3