Reputation: 3505
I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like
df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA),
player=c(NA,"Ed",NA,"Ed","Ed",NA),
playerGame= c(NA,1,NA,2,3,NA),
teamGame =c(1,2,3,4,5,6))
Where the NA's indicate that the player did not appear in that specific team game
How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3
EDIT
Sorry, I wrote this when I woke up in the middle of the night and may have over-simplified my problem too much. Only one person seems to have picked up on the fact that this is a subset of a much larger set of data and even he/she did not follow it though that a straight hardcode replacement of player and team was insufficient Thanks for the replies. Dsee's hint for the na.locf in the zoo package and the first line of AK's answer appears to offer the best way forward
df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
na.locf(df$playerGame)
This covers the eventuality of more than one NA to start the sequence. In my case the min(df$teamGame) will always be 1 so hardcoding that may speed things up
A more realistic example is here
library(zoo)
library(plyr)
newdf <- data.frame(team=c("CRP","CRP","CRP","CRP","CRP","CRP","TOT","TOT","TOT"),
player=c(NA,"Ed",NA,"Bill","Bill",NA,NA,NA,"Tom"),
playerGame= c(NA,1,NA,1,2,NA,NA,NA,1),
teamGame =c(1,2,3,1,2,3,1,2,3))
I can now show the team for every row Each team plays three games in a season. Ed and Bill, play for CRP and appear in games 2 and 1,2 respectively. Tom plays for TOT in game 3 only. Assume that player names are unique(even in real world data)
It seems to me that I need to create another column, 'playerTeam'
newdf$playerTeam <- 0
for (i in 1:nrow(newdf)) {
newdf$playerTeam[i] <-ceiling(i/3)
}
I can then use this value to fill in the player gaps. I have used the sort functiom which omits NA
newdf <- ddply(newdf,.(playerTeam),transform,player=sort(player)[1])
I can then use the aforementioned code
newdf$playerGame[newdf$teamGame == 1 & is.na(newdf$playerGame) == TRUE] <- 0
newdf$playerGame <- na.locf(newdf$playerGame)
team player playerGame teamGame playerTeam
1 CRP Ed 0 1 1
2 CRP Ed 1 2 1
3 CRP Ed 1 3 1
4 CRP Bill 1 1 2
5 CRP Bill 2 2 2
6 CRP Bill 2 3 2
7 TOT Tom 0 1 3
8 TOT Tom 0 2 3
9 TOT Tom 1 3 3
I will need to build in season as well but that should not be a problem
Am I missing anything here?
I have several hundred thousand rows to process so any speed ups would be helpful. For instance I would probably want to avoid ddply and use a data.table approach or another apply function, right
Upvotes: 0
Views: 4123
Reputation: 2651
There seem to be 2 parts to what you want:
For (1), you could do:
df$team[is.na(df$team)] <- 'CRP'
Similarly you can alter the other component of the dataframe
For (2) you could do this:
if(is.na(df$playerGame[1])) {
df$playerGame[1] <- 0
}
for(i in 2:length(df$playerGame)) {
if(is.na(x[i])) {
df$playerGame[i] <- df$playerGame[i-1]
}
}
then df$playerGame
is:
[1] 0 1 1 2 3 3
Perhaps there is a very nifty way to do this, but this is clearly readable...
Upvotes: 2
Reputation: 71
replace NA's for team and player:
df$team[is.na(df$team)] <- "CRP"
df$player[is.na(df$player)] <- "Ed"
There is probably a more efficient way to get the adjacent values, but is works.
If the first or/and the last value is NA, as in your example, I had to use two additional lines:
df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
df$playerGame[df$teamGame == max(df$teamGame) & is.na(df$playerGame) == TRUE] <- max(df$playerGame, na.rm = TRUE)
For all other observations, this get the adjacent values:
df$playerGame[is.na(df$playerGame) == TRUE] <- df$playerGame[-1]
df
team player playerGame teamGame
CRP Ed 0 1
CRP Ed 1 2
CRP Ed 1 3
CRP Ed 2 4
CRP Ed 3 5
CRP Ed 3 6
For more than one team and/or player I would suggest to combine it with ddply (plyr).
Upvotes: 1
Reputation: 55350
to select the NA's, for say player
, use
df$player[is.na(df$player)]
Then to assign a value to these use
df$player[is.na(df$player)] <- "Ed"
If you just want to assign the entire player column the same name, you do not need to select any values:
df$player[] <- "Ed" # you can omit the brackets [], which are shown just for emphasis
you can then do the same for df$team
stringsAsFactors=FALSE
data.frame( . , stringsAsFactors=FALSE)
Upvotes: 1