Scraping Javascript-rendered webpage that references external javascript scripts in R

Question

I am trying to scrape this webpage: https://www.mustardbet.com/sports/events/302698

Since the webpage seems to be rendered dynamically, I am following this tutorial: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8

As the tutorial suggests, I save a file named "scrape_mustard.js" with the following code:

// scrape_mustard.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'mustard.html'

page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

Then, I perform

system("./phantomjs scrape_mustard.js")

but I get the error:

ReferenceError: Can't find variable: Set

  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1

Now, when I paste "https://www.mustardbet.com/assets/js/index.dfd873fb.js" into my browser I can see that it's javascript, and that I probably need to either (1) save that as a file, or (2) include it in scrape_mustard.js.

But if (1), I don't know how to then reference that new file, and if (2), I don't know how to define all that javascript properly so that it can be used.

I'm a complete newbie to javascript, but maybe this problem is not too difficult?

Thanks for your help!

JdeMello · Accepted Answer

I was able to scrape using the js module puppeteer.js.

Download node.js here. node.js comes with npm which makes your life easier when comes to install modules. You need to install puppeteer using npm.

In RStudio, make sure you are on your working directory when you are installing puppeteer.js. Once node.js is installed, do (source):

system("npm i puppeteer")

scrape_mustard.js:

// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");

// page url
url = "https://www.mustardbet.com/sports/events/302698";

scrape = async() => {
    const browser = await puppeteer.launch({headless: false}); // open browser
    const page = await browser.newPage(); // open new page
    await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
    await page.waitFor(5000); // give it time to load all the javascript rendered content
    const html = await page.content(); // copy page contents
    browser.close(); // close chromium
    return html // return html object
};

scrape().then((value) => {
    fs.writeFileSync("./stackoverflow/page.html", value) // write the object being returned by scrape()
});

To run scrape_mustard.js in R:

library(magrittr)

system("node ./stackoverflow/scrape_mustard.js")

html <- xml2::read_html("./stackoverflow/page.html")

oddsMajor <- html %>% 
  rvest::html_nodes(".odds-major")

betNames <- html %>% 
  rvest::html_nodes("h3")

Console output:

{xml_nodeset (60)}
 [1] 2
 [2] 14
 [3] 15
 [4] 16
 [5] 17
 [6] 23
 [7] 25
 [8] 32
 [9] 33
[10] 39
[11] 47
[12] 54
[13] 55
[14] 58
[15] 58
[16] 64
[17] 73
[18] 73
[19] 92
[20] 98
...
> betNames
{xml_nodeset (60)}
 [1] Charles Howell III

 [2] Brian Harman

 [3] Austin Cook

 [4] J.J. Spaun

 [5] Webb Simpson

 [6] Cameron Champ

 [7] Peter Uihlein

 [8] Seung-Jae Im

 [9] Nick Watney

[10] Graeme McDowell

[11] Zach Johnson

[12] Lucas Glover

[13] Corey Conners

[14] Luke List

[15] David Hearn

[16] Adam Schenk

[17] Kevin Kisner

[18] Brian Gay

[19] Patton Kizzire

[20] Brice Garnett

...

I am sure it can be done with phantomjs but I've found puppeteer easier to scrape javascript-rendered webpages. Also keep in mind that phantomjs is no longer being developed.

Scraping Javascript-rendered webpage that references external javascript scripts in R

Answers (1)

Related Questions