R webscraping, unsure how to proceed

Question

For a side project, I'm attempting to gather statistics for players in the NFL related to fantasy football. I found a URL that has the data I want: https://www.cbssports.com/fantasy/football/stats/QB/2020/ytd/stats/ppr/

I'm trying to scrape it in R and am having no luck. I've tried out lots of things, the closest I've gotten is this:

Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr')

There's the code I've got so far and here's the result:

Test1
{xml_nodeset (69)}
 [1]


It's just pure chaos with the relevant information embedded in there. I also tried using html_table() on it and just got an error.
Now when if I use the View function on "Test1" I can drill through many layers of data and find what I'm looking for, but what I'm trying to figure out is how to get to that data directly.
I'm not really sure where to go from here. If anyone could give me some pointers I'd really appreciate it. My familiarity with HTML is super low, I'm trying to read more about it and understand, but from what I was able to gather by inspecting the page was that the data was stored inside the class "TableBase-bodyTr" which is why I pointed the node there.

Dave2e · Accepted Answer

There is something funky the table formatting which is causing an error the html_table(). No quite sure how to correct that.

Here is an alternative to scrape the contents of the rows and then create the dataframe.

library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") 

#find the rows of the table
rows<-page%>% html_nodes('tr')

#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws() 
playername <- matrix(playername, ncol=2, byrow=TRUE)

#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws() 
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws() 

#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws() 
stats <-matrix(cols, ncol=16, byrow=TRUE)

#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team",  'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames

This will get you 95% there, I didn't attempt to automatically retrieve the column names from the web page. It was easier to manually copy and paste and assign the column names.

R webscraping, unsure how to proceed

Answers (1)

Related Questions