Reputation: 77
I am trying to collect some data from the OTC Markets (within the confines of their robots.txt) and I cannot connect to the webpage.
Here is a reproducible version of my code, any help would be appreciated. Phantomjs can be downloaded here: http://phantomjs.org/
library(rvest)
library(xml2)
library(V8)
# example website, I have no correlation to this stock
url <- 'https://www.otcmarkets.com/stock/YTROF/profile'
# create javascript file that phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
html <- system("phantomjs.exe_PATH scrape.js", intern = TRUE)
page_html <- read_html(html)
Upvotes: 0
Views: 126
Reputation: 2213
I have been able to get the html content with the following code which is not based on PhantomJS but on Selenium :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate('https://www.otcmarkets.com/stock/YTROF/profile')
remDr$executeScript("scroll(0, 5000)")
remDr$executeScript("scroll(0, 10000)")
remDr$executeScript("scroll(0, 15000)")
Sys.sleep(4)
remDr$screenshot(display = TRUE, useViewer = TRUE)
html_Content <- remDr$getPageSource()[[1]]
It is important to give time to the page to load before we extract the html content.
Here is another approach based on RDCOMClient :
library(RDCOMClient)
url <- 'https://www.otcmarkets.com/stock/YTROF/profile'
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
Sys.sleep(5)
html_Content <- doc$documentElement()$innerText()
Upvotes: 1