Ramon Melo
Ramon Melo

Reputation: 270

read_html() induces fatal error in R session

I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.

require(rvest)
require(stringr)
require(boilerpipeR)

# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"

content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA' 

# next line induces fatal error 
encoded_exit = read_html(content_html ,encoding = "UTF-8")

paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")

This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.

The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.

What could be causing this issue?
And how can I stop it from crashing the entire R session?

Upvotes: 1

Views: 828

Answers (1)

A.Milleroni52653
A.Milleroni52653

Reputation: 46

I had the same problem. RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit. The solution for me was to look at the URL I was scraping. If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org. In your case:

"Error: Start tag body seen but an element of the same type was already open."

From line 107, column 1; to line 107, column 25

crashed it. So your document had two <body><body> opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:

url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")

# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
   print("Skip this Site")
}

# proceed with html_nodes(..) etc

Upvotes: 1

Related Questions