Reputation: 1014
I'm trying to download some files from a webpage with javascript rendered content in R, and it has been confusing to me.
The files are in a table. My idea is to read and retrieve the page, scrape the table, identify URLs and download files. This is the first step: read and retrieve the page.
After some searches I found a solution using phantomjs, which seems very nice to me. I'm not proficient in JS, so I can understand the code but I have few ideas on how to make this work in my scenario.
My current script is:
// scrape_super_data_science_ml_data.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'super_data_science_ml_data.html'
page.open('https://www.superdatascience.com/pages/machine-learning', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
After the call, the page is downloaded but without the JS rendered content. I don't know if it's a matter of time to render the content before finishing the page retrieval, or whatever.
Here's an example of my process in R:
# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js")
# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.hmtl")
page %>% html_text()
Can someone help me? Any tip will be be very appreciated!
Upvotes: 0
Views: 64
Reputation: 645
I'm not sure if this is the exact code you're using, but there are some errors in the code you posted. For the phantomjs code I use
var system = require('system');
var page = require('webpage').create();
page.open('https://www.superdatascience.com/pages/machine-learning', function()
{
console.log(page.content);
phantom.exit();
});
I then call the code in R with
# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js > super_data_science_ml_data.html")
# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.html")
page %>% html_text()
First error is you forgot to have R save the html with system()
and the second was a spelling error "super_data_science_ml_data.hmtl"
In relation to your question about rendering, one of the main purposes of phantomjs vs rvest is that it renders the js as it is a headless browser rather than a simpler scraper like rvest.
Upvotes: 1