hossibley
hossibley

Reputation: 253

How can I scrape this data?

I want to scrape the statistics from this page:

url <- "http://www.pgatour.com/players/player.20098.stuart-appleby.html/statistics"

Specifically, I want to grab the data in the table that's underneath Stuart's headshot. It's headlined by "Stuart Appleby - 2015 STATS PGA TOUR"

I attempt to use rvest, in combo with the Selector Gadget (http://selectorgadget.com/).

url_html <- url %>% html()
url_html %>% 
        html_nodes(xpath = '//*[(@id = "playerStats")]//td')

'Should' get me the table without, for example, the row on top that says "Recap -- Rank -- Additional Stats"

url_html <- url %>% html()
url_html %>% 
    html_nodes(xpath = '//*[(@id = "playerStats")] | //th//*[(@id = "playerStats")]//td') 

'Should' get me the table with that "Recap -- Rank -- Add'l Stats" line.

Neither do.

Obvs I'm a complete newb when it comes to web scraping. When I click on 'view source' for that webpage, the data contained in the table isn't there.

In the source code, where I think the table should be starting, is this bit of code:

<script id="playerStatsTourTemplate" type="text/x-jquery-tmpl">
    {{each(t, tour) tours}}
        {{if pgatour.players.shouldProcessTour(tour.tourCodeLC)}}
        <div class="statistics-head">
            <h2 class="title">Stuart&nbsp;Appleby - <b>${year} STATS 
.
.
.

So, it appears the table is stored somewhere (Json? Jquery? Javascript? Are those terms applicable here?) that isn't accessible to the html() function. Is there anyway to use rvest to grab this data? Is there an rvest equivalent for grabbing data that is stored in this manner?

Thanks.

Upvotes: 1

Views: 3938

Answers (2)

koxon
koxon

Reputation: 898

Check this out.

Open source project on GitHub scraping PGA data: https://github.com/zachwill/golf/blob/master/pga.py

Upvotes: 1

cory
cory

Reputation: 6659

I'd probably use the GET request that the page is making to get the raw data from their API and work on parsing that...

content(a) gives you a list representation... basically the output from fromJSON()
or
as(a, "character") gives you the raw JSON

library("httr")
a <- GET("http://www.pgatour.com/data/players/20098/2014stat.json")
content(a)
as(a, "character")

Upvotes: 2

Related Questions