Frank B.
Frank B.

Reputation: 1883

Trouble with Webscraping in R using XML Package

I've used the XML package successfully to scrape multiple websites, but I'm having trouble creating a data frame from this specific page:

library(XML)

url <- paste("http://www.foxsports.com/nfl/injuries?season=2013&seasonType=1&week=1", sep = "")
df1 <- readHTMLTable(url)

print(df1)

> print(df1)
$`NULL`
NULL

$`NULL`
NULL

$`NULL`
             Player Pos         Injury           Game Status
1       Dickson, Ed  TE          thigh              Probable
2      Jensen, Ryan   C           foot              Doubtful
3     Jones, Arthur  DE        illness                   Out
4   McPhee, Pernell  LB           knee              Probable
5     Pitta, Dennis  TE dislocated hip Injured Reserve (DFR)
6  Thompson, Deonte  WR           foot              Doubtful
7 Williams, Brandon  DT            toe              Doubtful

$`NULL`
           Player Pos        Injury Game Status
1  Anderson, C.J.  RB          knee         Out
2   Ayers, Robert  DE      Achilles    Probable
3   Bailey, Champ  CB          foot         Out
4     Clady, Ryan   T      shoulder    Probable
5  Dreessen, Joel  TE          knee         Out
6    Kuper, Chris   G         ankle    Doubtful
7 Osweiler, Brock  QB left shoulder    Probable
8     Welker, Wes  WR         ankle    Probable

$`NULL`

etc

If I try to coerce it I get this error:

> df1 <- data.frame(readHTMLTable(url))
Error in data.frame(`NULL` = NULL, `NULL` = NULL, `NULL` = list(Player = 1:7,  : 
  arguments imply differing number of rows: 0, 7, 8, 6, 9, 1, 11, 4, 12, 5, 21, 3, 2, 15

I'd like all of the injury data (PLAYER, POS, INJURY, GAME STATUS) for all of the teams.

Thanks in advance.

Upvotes: 1

Views: 327

Answers (2)

Tony Breyal
Tony Breyal

Reputation: 5378

# Packages
require(XML)
require(RCurl)

# URL of interest
url <- paste("http://www.foxsports.com/nfl/injuries?season=2013&seasonType=1&week=1", sep = "")

# Parse HTML
doc <- htmlParse(url)

# Tables which are not nulls
df1 <- readHTMLTable(doc)
df.list <- df1[!as.vector(sapply(df1, is.null))]

# Get table names
table.names <- xpathSApply(doc, "//div[@class='wisfb_injuryHeader']", function(x) gsub("^\\s+|\\s+$", "", xmlValue(x)))

# Assign names
names(df.list) <- table.names


# $`San Diego Chargers`
# Player Pos                         Injury Game Status
# 1    Floyd, Malcom  WR                           knee    Probable
# 2   Ingram, Melvin  LB                  Torn left ACL  Day-to-Day
# 3    Liuget, Corey  DE                       shoulder    Probable
# 4  Patrick, Johnny  CB concussion, not injury related    Probable
# 5     Royal, Eddie  WR              chest, concussion    Probable
# 6  Taylor, Brandon   S                           knee    Probable
# 7      Te'o, Manti  LB                           foot         Out
# 8 Wright, Shareece  CB                          chest    Probable
# #[etc.]

EDIT: Just saw the @Spacedman said basically the same thing in one of the comments to the answer by @Chris S.

Upvotes: 1

Chris S.
Chris S.

Reputation: 2225

You just need to remove the NULL elements and tables with 1 column listing "No injuries reported" and then rbind using do.call

n<-sapply(df1, function(x) !is.null(x) && ncol(x)==4)
x <-  do.call("rbind", df1[n])
rownames(x)<-NULL

Upvotes: 2

Related Questions