Scrape js rendered content with phantonjs

Question

I'm trying to download some files from a webpage with javascript rendered content in R, and it has been confusing to me.

The files are in a table. My idea is to read and retrieve the page, scrape the table, identify URLs and download files. This is the first step: read and retrieve the page.

After some searches I found a solution using phantomjs, which seems very nice to me. I'm not proficient in JS, so I can understand the code but I have few ideas on how to make this work in my scenario.

My current script is:

// scrape_super_data_science_ml_data.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'super_data_science_ml_data.html'

page.open('https://www.superdatascience.com/pages/machine-learning', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

After the call, the page is downloaded but without the JS rendered content. I don't know if it's a matter of time to render the content before finishing the page retrieval, or whatever.

Here's an example of my process in R:

# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js")

# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.hmtl")
page %>%  html_text()

Can someone help me? Any tip will be be very appreciated!

Scrape js rendered content with phantonjs

Answers (1)

Related Questions