Reputation: 4482
I would like to web scrape this web site
In particular I would like to take the information that it is in that table:
Please note that I choose a specific date on the upper right corner.
By following this guide
I wrote the following code
library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'
webpage_nba <- read_html(url_nba)
#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')
#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")
From my understanding the numbers that I want to get ( e.g. For Warriors it would be 94%, 79%, 66%, 59%) are "coded" in a different way. In other words, what it is written in the web scraping test.csv
is not readable.
Is there any way that I can transform the "coded numbers" into "regular numbers" ?
Upvotes: 3
Views: 643
Reputation: 4482
Thanks to @Alexey answer and this, the following code worked for me
library(RSelenium)
library(rvest)
library(wdman)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'
#initiate RSelenium. If it doesn't work, try other browser engines
# rD <- rsDriver()
# remDr <- rD$client
pDrv <- phantomjs(port = 4567L)
remDr <- remoteDriver(browserName = "phantomjs", port = 4567L)
remDr$open()
#navigate to main page
remDr$navigate(url_nba)
#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()
# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
pDrv$stop()
# rD[["server"]]$stop()
# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]
# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df
Upvotes: 1
Reputation: 630
I tried parse the data using rvest
, but it seems that challenging problem here is to click dropdown menu, represented by <select>
tag in HTML structure. So I equipped heavy artillery - RSelenium
which is browser emulator. Using it everything became easy, thanks to the answer on SO:
library(RSelenium)
library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'
#initiate RSelenium. If it doesn't work, try other browser engines
rD <- rsDriver(port=4444L,browser="firefox")
remDr <- rD$client
#navigate to main page
remDr$navigate(url_nba)
#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()
# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
rD[["server"]]$stop()
# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]
# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df
ELO Carm-ELO 1-Week Change Team Conf. Conf. Semis Conf. Finals Finals Win Title
4 1770 1792 -14 Warriors West 94% 79% 66% 59%
5 1661 1660 -43 Spurs West 90% 62% 15% 11%
6 1600 1603 +33 Raptors East 77% 47% 25% 5%
7 1636 1640 +33 Clippers West 58% 11% 7% 5%
8 1587 1589 -22 Celtics East 70% 42% 24% 4%
9 1587 1584 -9 Wizards East 79% 38% 21% 4%
10 1617 1609 +16 Jazz West 42% 7% 5% 3%
11 1602 1606 -18 Rockets West 70% 27% 5% 3%
12 1545 1541 -22 Cavaliers East 59% 27% 11% 2%
13 1519 1523 +25 Bulls East 30% 15% 7% <1%
14 1526 1520 +37 Pacers East 41% 17% 6% <1%
15 1563 1564 +6 Trail Blazers West 6% 3% 1% <1%
16 1543 1537 -20 Thunder West 30% 8% <1% <1%
17 1502 1502 -3 Bucks East 23% 9% 3% <1%
18 1479 1469 +46 Hawks East 21% 6% 2% <1%
19 1482 1480 -41 Grizzlies West 10% 3% <1% <1%
20 1569 1555 +32 Heat East — — — —
21 1552 1533 +27 Nuggets West — — — —
22 1482 1489 -12 Pelicans West — — — —
23 1463 1472 -18 Timberwolves West — — — —
24 1463 1462 -40 Hornets East — — — —
25 1441 1436 +22 Pistons East — — — —
26 1420 1421 -20 Mavericks West — — — —
27 1393 1395 -2 Kings West — — — —
28 1374 1379 -13 Knicks East — — — —
29 1367 1370 +47 Lakers West — — — —
30 1372 1370 -14 Nets East — — — —
31 1352 1355 -9 Magic East — — — —
32 1338 1348 -29 76ers East — — — —
33 1340 1337 +26 Suns West — — — —
If you want to parse other time periods, check the option value in the HTML of the page using the Dev Tools of your browser.
Upvotes: 4