Canovice
Canovice

Reputation: 10491

rvest not able to grab html table using html_nodes("table"), despite table being on page

We are struggling to grab the main table at this fangraphs link. Using rvest:

url = 'https://www.fangraphs.com/leaders/splits-leaderboards?splitArr=1&splitArrPitch=&position=B&autoPt=false&splitTeams=false&statType=team&statgroup=2&startDate=2021-07-07&endDate=2021-07-21&players=&filter=&groupBy=season&sort=9,1'
table_nodes = url %>% read_html() %>% html_nodes('table')
table_nodes

 table_nodes
{xml_nodeset (7)}
[1] <table class="menu-standings-table"><tbody><tr>\n<td>\r\n                                            <div class="menu-sub-header">AL East</div>\r\n                       ...
[2] <table class="menu-team-table">\n<tr>\n<td>\r\n                                        <div class="menu-sub-header">AL East</div>\r\n                                     ...
[3] <table class="menu-team-table">\n<tr>\n<td>\r\n                                        <div class="menu-sub-header">AL East</div>\r\n                                     ...
[4] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-45-prospects-baltimore-orioles">BAL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-34-prospects ...
[5] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-30-prospects-atlanta-braves">ATL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-49-prospects-ch ...
[6] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-40-prospects-baltimore-orioles">BAL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-38-prospects ...
[7] <table>\n<tr>\n<td><a href="http://www.fangraphs.com/blogs/top-27-prospects-atlanta-braves">ATL</a></td>\n<td><a href="http://www.fangraphs.com/blogs/top-41-prospects-ch ...

None of these 7 tables are the main table at the URL with all of the different team stats. url %>% read_html() %>% html_nodes('div.table-scroll') returns an empty nodeset, and div.table-scroll is the wrapper div that the main table is located in.

Edit: I guess here is the network request, but still not sure how to get API call from this. How to see the full API call for this?

enter image description here

enter image description here

Upvotes: 1

Views: 162

Answers (1)

QHarr
QHarr

Reputation: 84475

Data is dynamically retrieved from an API call. Switch to httr as you need to make a POST request which includes the start/end date. Also, switch to infinite in terms of returning as much data as possible, with as few calls as possible.

You want to convert the below into some form of custom function which accepts date args.

library(httr)
library(purrr)

headers = c(
  'user-agent' = 'Mozilla/5.0',
  'content-type' = 'application/json;charset=UTF-8'
)

data = '{"strPlayerId":"all","strSplitArr":[1],"strGroup":"season","strPosition":"B","strType":"2","strStartDate":"2021-07-07","strEndDate":"2021-07-21","strSplitTeams":false,"dctFilters":[],"strStatType":"team","strAutoPt":"false","arrPlayerId":[],"strSplitArrPitch":[]}'

r <- httr::POST(url = 'https://www.fangraphs.com/api/leaders/splits/splits-leaders', httr::add_headers(.headers=headers), body = data) %>% content()

df <- map_df(r$data, data.frame)

Upvotes: 3

Related Questions