Dambo
Dambo

Reputation: 3486

How to access a page scraped using RSelenium with rvest?

I am trying to scrape a webpage which uses angular.js. My understanding is that the only option in R is to use RSelenium to load the page first, and then parse the content. However, I find rvest more intuitive than RSelenium to parse the content, thus I would like to work as little as possible with RSelenium and then switch to rvest as soon as I can.

So far I have realized that I probably need at least to use RSelenium to connect and download the html code using htmlTreeParse. Suppose this is part of my output:

structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

How can I pass it to rvest::read_html()?

Upvotes: 0

Views: 1464

Answers (1)

alistaire
alistaire

Reputation: 43334

If you look at the class of your item, it's an XMLNode, which is a class defined by the XML package. In it, it defines a method for toString (but not as.character, curiously) that allows you to convert the node to an ordinary string, which can in turn be read in by xml2::read_html:

library(rvest)
#> Loading required package: xml2

node <- structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

node %>% XML::toString.XMLNode() %>% read_html()
#> {xml_document}
#> <html>
#> [1] <body><div class="im_dialog_date" ng-bind="dialogMessage.dateText">6 ...

That said, I normally just use the RSelenium::remoteDriver's getPageSource() method to just grab all the HTML, which is then easily parsed with rvest.

Upvotes: 3

Related Questions