JonnyRobbie
JonnyRobbie

Reputation: 614

R rvest html scraping

I have an R script which is something like this:

id <- "25731"
url_name <- "Cross_Ange:_Tenshi_to_Ryuu_no_Rondo"
library(rvest)
html_content <- html(paste("http://myanimelist.net/anime/", id, "/", url_name, "/stats", sep=""))
test_page <- html_node(html_content, "div")

That test_page variable is just for checking if the page loaded correctly. The problem is that sometimes it doesn't. Sometimes the html_content variable contains some weird HTML content, like

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html style="height:100%">
<head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&amp;xinfo=6-12427765-0%200NNN%20RT(1427731440619%201)%20q(0%20-1%20-1%20-1)%20r(0%20-1)%20B12(4,315,0)&amp;incident_id=124002150019133827-71376390758075766&amp;edet=12&amp;cinfo=04000000" frameborder="0" width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 124002150019133827-71376390758075766</iframe></body>
</html>

That is the wrong HTML content. And it's not even consistent in the wrong content. Sometimes it returns another wrong page.

The URL itself is correct, because if I try to send an HTTP request with Firefox, it returns the correct HTML on the first try as expected.

The weird thing is that if I try to run the line with the html() function several times, eventually it loads the correct HTML page without me changing anything. It is weirdly incosistent, which is terrible when I try to automate the execution with RScript.

I've set up a while loop that checks if the HTML has loaded correctly (if it finds any div tag using html_node()), but RScript throws an error, while RStudio executes it just fine:

Error in as.vector(x, "list") :
  cannot coerce type 'environment' to vector of type 'list'
Calls: html_node ... <Anonymous> -> lapply -> as.list -> as.list.default
Execution halted

In summary, RStudio has an inconsistent html() function which sometimes returns a weird page, but if I can force my way through it by repeatedly trying to execute the line, it works in the end. But RScript straight up throws an error.

R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"

Upvotes: 2

Views: 1018

Answers (2)

Carl Boneri
Carl Boneri

Reputation: 2722

This worked first try for me?

version  R version 3.3.1 (2016-06-21)
system   x86_64, linux-gnu           
ui       RStudio (0.99.903)          
language (EN)                        
collate  en_US.UTF-8                 
tz       <NA>                        
date     2016-09-23 


u <-'http://myanimelist.net/anime/25731/Cross_Ange:_Tenshi_to_Ryuu_no_Rondo/stats'

a <- html(u)

b <- (html_nodes(a,"#content table")) %>% html_table(fill = T)

colnames(b[[4]]) <- b[[4]][1,] %>% unlist %>% as.character

b[[4]] <- b[[4]][2:nrow(b[[4]]),]

> head(b[[4]])
===  ===============  =====  =============  ========  ==============
\    Member           Score  Status         Eps Seen  Activity      
===  ===============  =====  =============  ========  ==============
2    Kirito_Kun36     7      Completed      25 / 25   11 minutes ago
3    StargazerM       -      On-Hold        - / 25    14 minutes ago
4    Frosti_Limbu     -      Plan to Watch            21 minutes ago
5    ShadowGekko      -      Plan to Watch            37 minutes ago
6    meedly           -      Watching       12 / 25   39 minutes ago
7    cowboyninjabear  5      Completed      25 / 25   55 minutes ago
===  ===============  =====  =============  ========  ==============

Upvotes: 0

Bruno Lobo
Bruno Lobo

Reputation: 556

The website you're trying to load uses a service called Incapsula (www.incapsula.com). It prevents bots from accessing its content.

Upvotes: 1

Related Questions