RobertLD
RobertLD

Reputation: 77

Phantomjs returns 404 in R when attempting webscraping

I am trying to collect some data from the OTC Markets (within the confines of their robots.txt) and I cannot connect to the webpage.

  1. The first step I tried was just to scrape the HTML right off the page, but the page requires javascript to load.
  2. So I downloaded phantomjs and connected that way. However, this leads to a 404 error page
  3. I then changed the user-agent to something resembling a user to see if it would let me connect and still, no luck! What is going on here

Here is a reproducible version of my code, any help would be appreciated. Phantomjs can be downloaded here: http://phantomjs.org/

library(rvest)
library(xml2)
library(V8)
# example website, I have no correlation to this stock
url <- 'https://www.otcmarkets.com/stock/YTROF/profile' 

# create javascript file that phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

html <- system("phantomjs.exe_PATH scrape.js", intern = TRUE)
page_html <- read_html(html)

Upvotes: 0

Views: 126

Answers (1)

Emmanuel Hamel
Emmanuel Hamel

Reputation: 2213

I have been able to get the html content with the following code which is not based on PhantomJS but on Selenium :

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate('https://www.otcmarkets.com/stock/YTROF/profile')

remDr$executeScript("scroll(0, 5000)")
remDr$executeScript("scroll(0, 10000)")
remDr$executeScript("scroll(0, 15000)")
Sys.sleep(4)

remDr$screenshot(display = TRUE, useViewer = TRUE) 
html_Content <- remDr$getPageSource()[[1]]

It is important to give time to the page to load before we extract the html content.

Here is another approach based on RDCOMClient :

library(RDCOMClient)
url <- 'https://www.otcmarkets.com/stock/YTROF/profile'
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)

Sys.sleep(5)
doc <- IEApp$Document()

Sys.sleep(5)
html_Content <- doc$documentElement()$innerText()

Upvotes: 1

Related Questions