CoolGuyHasChillDay
CoolGuyHasChillDay

Reputation: 747

rvest: html_table() only picks up header row. Table has 0 rows

I'm learning how to webscrape with rvest and I'm running into some issues. Specifically, the code is only picking up the header-row.

library(rvest)
library(XML)

URL1 <- "https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25"
df <- URL1 %>% read_html() %>% html_node("#stat-table") %>% html_table()

Calling df results in a data.frame with 7 columns and 0 rows. I installed inspector gadget, and even that is telling me that id = #stat-table is correct. What is unique about this website that it doesn't want to pickup the table data?

As a separate question, if I "View Page Source", I can see all the data on the page and I wouldn't have to use RSelenium to flip through DK, FD, or yahoo salaries. It looks like there are keys that would be easy to find (e.g. find "FD" > find all "player name:" and pick up characters after, etc), but I don't know of a library/process that handles the page source. Are there any resources for this?

Thanks.

Upvotes: 1

Views: 1173

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

You could -- in theory -- extract the data from the <script> tag and then process it with V8 but this is also pretty easy to do with splashr or seleniumPipes. I wrote splashr so I'll show that:

library(splashr)
library(rvest)

start_splash()

pg <- render_html(url="https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25")

html_node(pg, "table#stat-table") %>% 
  html_table() %>% 
  tibble::as_tibble() 
## # A tibble: 256 x 7
##    Position          Player Salary         Change `Proj Fantasy Pts` `Avg Fantasy Pts`   Diff
##       <chr>           <chr>  <chr>          <chr>              <dbl>             <chr>  <chr>
##  1       PF      Thon Maker $3,900  +$600 (18.2%)              12.88             13.24  -0.36
##  2       PG DeAndre Liggins $3,500  +$500 (16.7%)               9.68              7.80  +1.88
##  3       PG   Elfrid Payton $6,400  +$700 (12.3%)              32.77             28.63  +4.14
##  4        C   Jahlil Okafor $3,000 -$400 (-11.8%)               1.71             12.63 -10.92
##  5       PF    John Collins $5,200   +$400 (8.3%)              29.65             24.03  +5.63
##  6       SG     Buddy Hield $4,600  -$400 (-8.0%)              17.96             21.84  -3.88
##  7       SF    Aaron Gordon $7,000   +$500 (7.7%)              32.49             36.91  -4.42
##  8       PG    Kemba Walker $7,600  -$600 (-7.3%)              36.27             38.29  -2.02
##  9       PG    Lou Williams $6,600  -$500 (-7.0%)              34.28             30.09  +4.19
## 10       PG       Raul Neto $3,200   +$200 (6.7%)               6.81             10.57  -3.76
## # ... with 246 more rows

killall_splash()

BeautifulSoup won't read this data either. Well, you can target the <script> tag that has it in JS form and use a similar V8-engine on Python as well, but it's not going to be able to do this any easier than rvest.

Further expansion on ^^:

Most scraping guides tell you to do "Inspect Element" to eventually find the XPath or CSS selector to target. Inspecting on a random row of that table shows:

enter image description here

For "normal" sites, that generally works.

Sites with JS-rendered XHR requests (or on-page JS+data) will look like ^^ but your targeting won't work b/c read_html() (and the BeautifulSoup equiv) can't render JavaScript on pages without the help of some rendering engine. You can try to tell if this is happening by doing a View Source along with the element inspection. Here's the View Source for that site cropped to the very long lines of data + JS + HTML that eventually make the table:

enter image description here

I've posted numerous SO answers for how to target those <script> tags and use V8. Using splashr or decapitated is just easier (if they're installed and working).

If you don't want to deal with Docker and use a recent version of Chrome, you can also follow the guidance here to get headless working and do:

res <- system2("chrome", c("--headless", "--dump-dom", "https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25"), stdout=TRUE)

res then becomes plain HTML that you can read in with rvest and scrape away.

A package-in-development —- decapitated -- makes ^^ a bit less ugly:

install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)

chrome_version()
## Google Chrome 63.0.3239.59 beta

pg <- chrome_read_html("https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25")

html_node(pg, "table#stat-table") %>% 
  html_table() %>% 
  tibble::as_tibble() 
## # A tibble: 256 x 7
##    Position          Player Salary         Change `Proj Fantasy Pts` `Avg Fantasy Pts`   Diff
##       <chr>           <chr>  <chr>          <chr>              <dbl>             <chr>  <chr>
##  1       PF      Thon Maker $3,900  +$600 (18.2%)              12.88             13.24  -0.36
##  2       PG DeAndre Liggins $3,500  +$500 (16.7%)               9.68              7.80  +1.88
##  3       PG   Elfrid Payton $6,400  +$700 (12.3%)              32.77             28.63  +4.14
##  4        C   Jahlil Okafor $3,000 -$400 (-11.8%)               1.71             12.63 -10.92
##  5       PF    John Collins $5,200   +$400 (8.3%)              29.65             24.03  +5.63
##  6       SG     Buddy Hield $4,600  -$400 (-8.0%)              17.96             21.84  -3.88
##  7       SF    Aaron Gordon $7,000   +$500 (7.7%)              32.49             36.91  -4.42
##  8       PG    Kemba Walker $7,600  -$600 (-7.3%)              36.27             38.29  -2.02
##  9       PG    Lou Williams $6,600  -$500 (-7.0%)              34.28             30.09  +4.19
## 10       PG       Raul Neto $3,200   +$200 (6.7%)               6.81             10.57  -3.76
## # ... with 246 more rows

NOTE: Headless Chrome is having issues on High Sierra due to the new permissions and sandboxing. It works on older macOS systems and Windows/Linux. You just need the right version and the right environment variable set.

Upvotes: 5

Related Questions