Bruno Pinheiro
Bruno Pinheiro

Reputation: 1014

Scrape js rendered content with phantonjs

I'm trying to download some files from a webpage with javascript rendered content in R, and it has been confusing to me.

The files are in a table. My idea is to read and retrieve the page, scrape the table, identify URLs and download files. This is the first step: read and retrieve the page.

After some searches I found a solution using phantomjs, which seems very nice to me. I'm not proficient in JS, so I can understand the code but I have few ideas on how to make this work in my scenario.

My current script is:

// scrape_super_data_science_ml_data.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'super_data_science_ml_data.html'

page.open('https://www.superdatascience.com/pages/machine-learning', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

After the call, the page is downloaded but without the JS rendered content. I don't know if it's a matter of time to render the content before finishing the page retrieval, or whatever.

Here's an example of my process in R:

# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js")

# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.hmtl")
page %>%  html_text()

Can someone help me? Any tip will be be very appreciated!

Upvotes: 0

Views: 64

Answers (1)

piie
piie

Reputation: 645

I'm not sure if this is the exact code you're using, but there are some errors in the code you posted. For the phantomjs code I use

var system = require('system');
var page = require('webpage').create();


page.open('https://www.superdatascience.com/pages/machine-learning', function()
{
    console.log(page.content);
    phantom.exit();
});

I then call the code in R with

# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js > super_data_science_ml_data.html")

# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.html")
page %>%  html_text()

First error is you forgot to have R save the html with system() and the second was a spelling error "super_data_science_ml_data.hmtl"

In relation to your question about rendering, one of the main purposes of phantomjs vs rvest is that it renders the js as it is a headless browser rather than a simpler scraper like rvest.

Upvotes: 1

Related Questions