Reputation: 175
I have R 3.3.1 installed and am using RStudio 0.99.903. I'm trying to read a table into R from the following URL: https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php
(I'm well aware that there is a download button, however, that is not an option for me right now)
I was able to do this easily last year with the readHTMLTable function. However, in that time, that site changed from using http to https, which causes an "XML content does not eem to be XML" error.
I tried what was suggested here: get url table into a `data.frame` R-XML-RCurl
library(XML)
library(RCurl)
url <- getURL("https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php")
df <- readHTMLTable(URL, header = T)
The get URL function returns a large string that is essentially meaningless to me, which means readHTMLTable doesn't work correctly (I get a list, with a couple of data frames, but those are also meaningless to me. It's in Spanish with observations of things that I have no idea where they came from):
>url
[1] "\r\n<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n <title>2016 QB Fantasy Football Rankings, QB Cheat Sheets, QB Draft / Draft Rankings</title>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n <meta name=\"description\" content=\"Don’t trust any 1 fantasy football expert? We combine their rankings into 1 Expert Consensus Ranking. Our 2016 Draft QB rankings are updated daily.\">\n<link rel=\"canonical\" href=\"https://www.fantasypros.com/nfl/rankings/qb-cheatsheets.php\" />\n\n <meta property=\"fb:pages\" content=\"184352014941166\"/>\n
It goes on for quite a ways more.
Can someone give me suggestions on how to get this to work?
Thanks.
Upvotes: 4
Views: 5473
Reputation: 12723
Get html file from the URL
library("httr")
library("XML")
URL <- "https://www.fantasypros.com/nfl/rankings/consensus-cheatsheets.php"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
Parse HTML file
doc <- htmlParse(temp)
XPath query is constructed by selecting the table
element with class = "player-table"
and its child tr
element having class = 'mpb-player-'
xpexpr <- "//table[contains(@class, 'player-table')]/tbody/tr[contains(@class, 'mpb-player-')]"
Get the list of nodes from doc for the xpath expression
listofTableNodes <- getNodeSet(doc, xpexpr)
listofTableNodes
create a dataframe with the xmlvalues of the node list
df <- xmlToDataFrame(listofTableNodes, stringsAsFactors = FALSE)
# alternatively xpathSApply can be used to get the same data frame
# df <- xmlToDataFrame(xpathSApply(doc, xpexpr), stringsAsFactors = FALSE)
remove empty columns
df <- df[, seq(1, length(df), by = 2)]
add column names
xpexpr <- "//table[contains(@class, 'player-table')]/thead/tr/th"
listofTableNodes <- getNodeSet(doc, xpexpr)
listofTableNodes
colnames(df) <- gsub("[\r\n ]*$", '', xmlSApply(listofTableNodes, xmlValue))
head(df)
# Rank Player (Team) Pos Bye Best Worst Avg Std Dev ADP vs. ADP
# 1 1 Antonio Brown PIT WR1 8 1 5 1.3 0.8 1.0 0.0
# 2 2 Odell Beckham Jr. NYG WR2 8 1 9 3.1 1.6 2.0 0.0
# 3 3 Julio Jones ATL WR3 11 1 6 3.4 1.1 4.0 +1.0
# 4 4 Todd Gurley LA RB1 8 1 11 4.5 2.3 3.0 -1.0
# 5 5 David Johnson ARI RB2 9 1 19 6.1 3.5 6.0 +1.0
# 6 6 Adrian Peterson MIN RB3 6 1 22 7.6 3.8 5.0 -1.0
Upvotes: 5