quant
quant

Reputation: 4482

Web scraping in R?

I would like to web scrape this web site

In particular I would like to take the information that it is in that table: enter image description here

Please note that I choose a specific date on the upper right corner.

By following this guide

I wrote the following code

library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'

webpage_nba <- read_html(url_nba)

#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')

#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")

From my understanding the numbers that I want to get ( e.g. For Warriors it would be 94%, 79%, 66%, 59%) are "coded" in a different way. In other words, what it is written in the web scraping test.csv is not readable.

Is there any way that I can transform the "coded numbers" into "regular numbers" ?

Upvotes: 3

Views: 643

Answers (2)

quant
quant

Reputation: 4482

Thanks to @Alexey answer and this, the following code worked for me

library(RSelenium)
library(rvest)
library(wdman)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
# rD <- rsDriver()
# remDr <- rD$client

pDrv <- phantomjs(port = 4567L)
remDr <- remoteDriver(browserName = "phantomjs", port = 4567L)
remDr$open()
#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
pDrv$stop()

# rD[["server"]]$stop() 


# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df

Upvotes: 1

Alex Knorre
Alex Knorre

Reputation: 630

I tried parse the data using rvest, but it seems that challenging problem here is to click dropdown menu, represented by <select> tag in HTML structure. So I equipped heavy artillery - RSelenium which is browser emulator. Using it everything became easy, thanks to the answer on SO:

library(RSelenium)
library(rvest)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
rD <- rsDriver(port=4444L,browser="firefox")
remDr <- rD$client

#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
rD[["server"]]$stop() 

# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]

df

    ELO Carm-ELO 1-Week Change          Team Conf. Conf. Semis Conf. Finals Finals Win Title
4  1770     1792           -14      Warriors  West         94%          79%    66%       59%
5  1661     1660           -43         Spurs  West         90%          62%    15%       11%
6  1600     1603           +33       Raptors  East         77%          47%    25%        5%
7  1636     1640           +33      Clippers  West         58%          11%     7%        5%
8  1587     1589           -22       Celtics  East         70%          42%    24%        4%
9  1587     1584            -9       Wizards  East         79%          38%    21%        4%
10 1617     1609           +16          Jazz  West         42%           7%     5%        3%
11 1602     1606           -18       Rockets  West         70%          27%     5%        3%
12 1545     1541           -22     Cavaliers  East         59%          27%    11%        2%
13 1519     1523           +25         Bulls  East         30%          15%     7%       <1%
14 1526     1520           +37        Pacers  East         41%          17%     6%       <1%
15 1563     1564            +6 Trail Blazers  West          6%           3%     1%       <1%
16 1543     1537           -20       Thunder  West         30%           8%    <1%       <1%
17 1502     1502            -3         Bucks  East         23%           9%     3%       <1%
18 1479     1469           +46         Hawks  East         21%           6%     2%       <1%
19 1482     1480           -41     Grizzlies  West         10%           3%    <1%       <1%
20 1569     1555           +32          Heat  East           —            —      —         —
21 1552     1533           +27       Nuggets  West           —            —      —         —
22 1482     1489           -12      Pelicans  West           —            —      —         —
23 1463     1472           -18  Timberwolves  West           —            —      —         —
24 1463     1462           -40       Hornets  East           —            —      —         —
25 1441     1436           +22       Pistons  East           —            —      —         —
26 1420     1421           -20     Mavericks  West           —            —      —         —
27 1393     1395            -2         Kings  West           —            —      —         —
28 1374     1379           -13        Knicks  East           —            —      —         —
29 1367     1370           +47        Lakers  West           —            —      —         —
30 1372     1370           -14          Nets  East           —            —      —         —
31 1352     1355            -9         Magic  East           —            —      —         —
32 1338     1348           -29         76ers  East           —            —      —         —
33 1340     1337           +26          Suns  West           —            —      —         —

If you want to parse other time periods, check the option value in the HTML of the page using the Dev Tools of your browser.

Upvotes: 4

Related Questions