cranberry
cranberry

Reputation: 55

R Web Scraping with rvest and V8

I am trying to use R to scrape the various tables on https://www.rotowire.com/football/player.php?id=4307 however due to the fact they employ javascript I have hit a few snags. I have installed the rvest and V8 libraries and tried to find the proper nodes however I am pretty sure I am not properly specifying the proper table nodes. I checked with the website owners and they are ok with people scraping their data.

The V8 webpage includes a snippet of example code to scrape email addresses. I tried to modify that code to suit my purposes.

#Loading both the required libraries
library(rvest)
library(V8)

link <- 'https://www.rotowire.com/football/player.php?id=4307'
emailjs <- read_html(link) %>% html_nodes('div') %>% html_nodes('basicStats') %>% html_text()

ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
  html_text()

With no success

I have also tried:

emailjs <- read_html(link) %>% html_nodes('div') %>% html_nodes('script') %>% html_text()
ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
   html_text()

As well as:

emailjs <- read_html(link) %>% html_nodes('div') %>% html_nodes('basicStats') %>% html_text()

The first chunk of code fails because I am incorrectly specifying the node, or at least that is what I think is the reason.

The second set of code pulls back everything however it gives the below error:

Error in context_eval(join(src), private$context) : 
  ReferenceError: window is not defined

If you look at the source the HTML the table starts with:

>div id=“basicStats” class=“”)

on line 289

The html:

            <div class="p-page__middle-box">

<div id="basicStats-header" class="p-page__section-head is-stats">NFL Stats</div>
<div id="basicStats">
    <div class="table-load"><div class="table-load__inner"><div class="loader"></div>Loading NFL Stats...</div></div>    </div>
    <script async>
document.addEventListener('rw:pp-data-available', function(e){
    var defaultData = { 'basic': { 'body': [], 'footer': [] }};
    var data = (e.detail) ? e.detail : defaultData;
    var tableID = "basicStats";
    var playerID = "4307";
    var primaryStatCat = "Pass";

    var stats = {
    'pass': [
        { id: 'passComp', startOfGroup: true, header: [{ text: 'Passing', colspan: 6, }, 'COMP'], },
        { id: 'passAtt', header: ['', 'ATT'], },
        { id: 'passPct', header: ['', 'PCT'], },
        { id: 'passYds', header: ['', 'YDS'], },
        { id: 'passTD', header: ['', 'TD'], },
        { id: 'passInt', header: ['', 'INT'], },
    ],

Upvotes: 3

Views: 2679

Answers (1)

QHarr
QHarr

Reputation: 84465

It is available if you use the same endpoint the page does to update content.It returns json with all the info.

library(httr)
r <-GET("https://www.rotowire.com/football/ajax/player-page-data.php?id=4307&pos=QB&team=GB&opp=")
json <- content(r,as="parsed")

Do what you want with the json. Explore the json here or paste the URL in FireFox browser.


You can find that url in the network tab

enter image description here

Upvotes: 5

Related Questions