Reputation: 747
I'm learning how to webscrape with rvest
and I'm running into some issues. Specifically, the code is only picking up the header-row.
library(rvest)
library(XML)
URL1 <- "https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25"
df <- URL1 %>% read_html() %>% html_node("#stat-table") %>% html_table()
Calling df
results in a data.frame with 7 columns and 0 rows. I installed inspector gadget, and even that is telling me that id = #stat-table
is correct. What is unique about this website that it doesn't want to pickup the table data?
As a separate question, if I "View Page Source", I can see all the data on the page and I wouldn't have to use RSelenium
to flip through DK, FD, or yahoo salaries. It looks like there are keys that would be easy to find (e.g. find "FD" > find all "player name:" and pick up characters after, etc), but I don't know of a library/process that handles the page source. Are there any resources for this?
Thanks.
Upvotes: 1
Views: 1173
Reputation: 78792
You could -- in theory -- extract the data from the <script>
tag and then process it with V8
but this is also pretty easy to do with splashr
or seleniumPipes
. I wrote splashr
so I'll show that:
library(splashr)
library(rvest)
start_splash()
pg <- render_html(url="https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25")
html_node(pg, "table#stat-table") %>%
html_table() %>%
tibble::as_tibble()
## # A tibble: 256 x 7
## Position Player Salary Change `Proj Fantasy Pts` `Avg Fantasy Pts` Diff
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 PF Thon Maker $3,900 +$600 (18.2%) 12.88 13.24 -0.36
## 2 PG DeAndre Liggins $3,500 +$500 (16.7%) 9.68 7.80 +1.88
## 3 PG Elfrid Payton $6,400 +$700 (12.3%) 32.77 28.63 +4.14
## 4 C Jahlil Okafor $3,000 -$400 (-11.8%) 1.71 12.63 -10.92
## 5 PF John Collins $5,200 +$400 (8.3%) 29.65 24.03 +5.63
## 6 SG Buddy Hield $4,600 -$400 (-8.0%) 17.96 21.84 -3.88
## 7 SF Aaron Gordon $7,000 +$500 (7.7%) 32.49 36.91 -4.42
## 8 PG Kemba Walker $7,600 -$600 (-7.3%) 36.27 38.29 -2.02
## 9 PG Lou Williams $6,600 -$500 (-7.0%) 34.28 30.09 +4.19
## 10 PG Raul Neto $3,200 +$200 (6.7%) 6.81 10.57 -3.76
## # ... with 246 more rows
killall_splash()
BeautifulSoup won't read this data either. Well, you can target the <script>
tag that has it in JS form and use a similar V8-engine on Python as well, but it's not going to be able to do this any easier than rvest
.
Further expansion on ^^:
Most scraping guides tell you to do "Inspect Element" to eventually find the XPath or CSS selector to target. Inspecting on a random row of that table shows:
For "normal" sites, that generally works.
Sites with JS-rendered XHR requests (or on-page JS+data) will look like ^^ but your targeting won't work b/c read_html()
(and the BeautifulSoup equiv) can't render JavaScript on pages without the help of some rendering engine. You can try to tell if this is happening by doing a View Source along with the element inspection. Here's the View Source for that site cropped to the very long lines of data + JS + HTML that eventually make the table:
I've posted numerous SO answers for how to target those <script>
tags and use V8
. Using splashr
or decapitated
is just easier (if they're installed and working).
If you don't want to deal with Docker and use a recent version of Chrome, you can also follow the guidance here to get headless working and do:
res <- system2("chrome", c("--headless", "--dump-dom", "https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25"), stdout=TRUE)
res
then becomes plain HTML that you can read in with rvest
and scrape away.
A package-in-development —- decapitated
-- makes ^^ a bit less ugly:
install_github("hrbrmstr/decapitated")
library(decapitated)
library(rvest)
chrome_version()
## Google Chrome 63.0.3239.59 beta
pg <- chrome_read_html("https://swishanalytics.com/optimus/nba/daily-fantasy-salary-changes?date=2017-11-25")
html_node(pg, "table#stat-table") %>%
html_table() %>%
tibble::as_tibble()
## # A tibble: 256 x 7
## Position Player Salary Change `Proj Fantasy Pts` `Avg Fantasy Pts` Diff
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 PF Thon Maker $3,900 +$600 (18.2%) 12.88 13.24 -0.36
## 2 PG DeAndre Liggins $3,500 +$500 (16.7%) 9.68 7.80 +1.88
## 3 PG Elfrid Payton $6,400 +$700 (12.3%) 32.77 28.63 +4.14
## 4 C Jahlil Okafor $3,000 -$400 (-11.8%) 1.71 12.63 -10.92
## 5 PF John Collins $5,200 +$400 (8.3%) 29.65 24.03 +5.63
## 6 SG Buddy Hield $4,600 -$400 (-8.0%) 17.96 21.84 -3.88
## 7 SF Aaron Gordon $7,000 +$500 (7.7%) 32.49 36.91 -4.42
## 8 PG Kemba Walker $7,600 -$600 (-7.3%) 36.27 38.29 -2.02
## 9 PG Lou Williams $6,600 -$500 (-7.0%) 34.28 30.09 +4.19
## 10 PG Raul Neto $3,200 +$200 (6.7%) 6.81 10.57 -3.76
## # ... with 246 more rows
NOTE: Headless Chrome is having issues on High Sierra due to the new permissions and sandboxing. It works on older macOS systems and Windows/Linux. You just need the right version and the right environment variable set.
Upvotes: 5