RAFrancais
RAFrancais

Reputation: 57

Subset in R not removing rows in data frame

To summarize the process, i pulled data from bbref, structured it in a data frame, converted 3 vectors to numeric from character, and took a subset of players who play greater than 20 games. Yet when i reorder the data and display the top 20, only a few non-qualifiers are removed and there are still many observations with less than 20 games played.

library(XML)
library(RCurl)
library(plyr)

urladv <- "https://www.basketball-reference.com/leagues/NBA_2019_advanced.html"
urladvdata <-  getURL(urladv)
dataadv <- readHTMLTable(urladvdata, stringsAsFactors = FALSE, encoding = "UTF-8")
datadv <- structure(dataadv, row.names =c(NA, -734), .Names = seq_along(dataadv), class = "data.frame")
advstats <- ldply(dataadv, data.frame)
advstats[,c('PER', 'BPM')] <- sapply(advstats[,c('PER','BPM', 'G')], as.numeric)
advstats <- subset(advstats, G > 20)
advstats <- advstats[with(advstats,order(-PER)),]
advstats[1:20,]

The output of advstats[1:20,] includes players like Trevon Duval, Gary Payton, and Alan williams who each have 5 or under games played. I'm confused what the special case of these observations are since the subset removes over 100 observations.

Upvotes: 0

Views: 282

Answers (1)

chrimaho
chrimaho

Reputation: 684

As mentioned by Ben in the comments, you're missing 'G' from the line that does sapply(). It should look like this:

advstats[,c('PER', 'BPM', 'G')] <- sapply(advstats[,c('PER','BPM', 'G')], as.numeric)

Because you were missing the 'G' on the left-hand side of the <-, it hadn't converted 'G' from <chr> to <dbl> data type. Therefore, when you ran the subset() function, it didn't work because you cannot use mathematical operators on character data types.

I trust that helps?

Upvotes: 1

Related Questions