Reputation: 270
I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.
require(rvest)
require(stringr)
require(boilerpipeR)
# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"
content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA'
# next line induces fatal error
encoded_exit = read_html(content_html ,encoding = "UTF-8")
paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")
This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.
The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.
What could be causing this issue?
And how can I stop it from crashing the entire R session?
Upvotes: 1
Views: 828
Reputation: 46
I had the same problem. RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit. The solution for me was to look at the URL I was scraping. If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org. In your case:
"Error: Start tag body seen but an element of the same type was already open."
From line 107, column 1; to line 107, column 25
crashed it. So your document had two <body><body>
opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:
url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")
# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
print("Skip this Site")
}
# proceed with html_nodes(..) etc
rrscriptrvestsession-abortedweb-scraping
Upvotes: 1