pssguy
pssguy

Reputation: 3505

replace NA in an R vector with adjacent values

I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like

df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA),
             player=c(NA,"Ed",NA,"Ed","Ed",NA),
             playerGame= c(NA,1,NA,2,3,NA),
             teamGame =c(1,2,3,4,5,6)) 

Where the NA's indicate that the player did not appear in that specific team game

How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3


EDIT

Sorry, I wrote this when I woke up in the middle of the night and may have over-simplified my problem too much. Only one person seems to have picked up on the fact that this is a subset of a much larger set of data and even he/she did not follow it though that a straight hardcode replacement of player and team was insufficient Thanks for the replies. Dsee's hint for the na.locf in the zoo package and the first line of AK's answer appears to offer the best way forward

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
na.locf(df$playerGame)

This covers the eventuality of more than one NA to start the sequence. In my case the min(df$teamGame) will always be 1 so hardcoding that may speed things up

A more realistic example is here

library(zoo)
library(plyr)

newdf <- data.frame(team=c("CRP","CRP","CRP","CRP","CRP","CRP","TOT","TOT","TOT"),
             player=c(NA,"Ed",NA,"Bill","Bill",NA,NA,NA,"Tom"),
             playerGame= c(NA,1,NA,1,2,NA,NA,NA,1),
             teamGame =c(1,2,3,1,2,3,1,2,3))

I can now show the team for every row Each team plays three games in a season. Ed and Bill, play for CRP and appear in games 2 and 1,2 respectively. Tom plays for TOT in game 3 only. Assume that player names are unique(even in real world data)

It seems to me that I need to create another column, 'playerTeam'

newdf$playerTeam <- 0

for (i in 1:nrow(newdf)) {
newdf$playerTeam[i] <-ceiling(i/3)
}

I can then use this value to fill in the player gaps. I have used the sort functiom which omits NA

newdf <- ddply(newdf,.(playerTeam),transform,player=sort(player)[1])

I can then use the aforementioned code

newdf$playerGame[newdf$teamGame == 1 & is.na(newdf$playerGame) == TRUE] <- 0
newdf$playerGame <- na.locf(newdf$playerGame)

   team player playerGame teamGame playerTeam
1  CRP     Ed          0        1          1
2  CRP     Ed          1        2          1
3  CRP     Ed          1        3          1
4  CRP   Bill          1        1          2
5  CRP   Bill          2        2          2
6  CRP   Bill          2        3          2
7  TOT    Tom          0        1          3
8  TOT    Tom          0        2          3
9  TOT    Tom          1        3          3

I will need to build in season as well but that should not be a problem

Am I missing anything here?

I have several hundred thousand rows to process so any speed ups would be helpful. For instance I would probably want to avoid ddply and use a data.table approach or another apply function, right

Upvotes: 0

Views: 4123

Answers (3)

A_K
A_K

Reputation: 2651

There seem to be 2 parts to what you want:

  1. You want to replace the player name & team with pre-determined values
  2. You wish to carry forward the counts of the games through the list of playerGame

For (1), you could do:

df$team[is.na(df$team)] <- 'CRP' 

Similarly you can alter the other component of the dataframe

For (2) you could do this:

if(is.na(df$playerGame[1])) {
    df$playerGame[1] <- 0
}
for(i in 2:length(df$playerGame)) { 
    if(is.na(x[i])) {
        df$playerGame[i] <- df$playerGame[i-1]
    }
} 

then df$playerGame is:

[1] 0 1 1 2 3 3

Perhaps there is a very nifty way to do this, but this is clearly readable...

Upvotes: 2

tfr
tfr

Reputation: 71

replace NA's for team and player:

df$team[is.na(df$team)] <- "CRP"
df$player[is.na(df$player)] <- "Ed"

There is probably a more efficient way to get the adjacent values, but is works.

If the first or/and the last value is NA, as in your example, I had to use two additional lines:

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
df$playerGame[df$teamGame == max(df$teamGame) & is.na(df$playerGame) == TRUE] <- max(df$playerGame, na.rm = TRUE)

For all other observations, this get the adjacent values:

df$playerGame[is.na(df$playerGame) == TRUE] <- df$playerGame[-1]

df

team player playerGame teamGame
CRP     Ed          0        1
CRP     Ed          1        2
CRP     Ed          1        3
CRP     Ed          2        4
CRP     Ed          3        5
CRP     Ed          3        6

For more than one team and/or player I would suggest to combine it with ddply (plyr).

Upvotes: 1

Ricardo Saporta
Ricardo Saporta

Reputation: 55350

to select the NA's, for say player, use

  df$player[is.na(df$player)]

Then to assign a value to these use

  df$player[is.na(df$player)]  <- "Ed"

If you just want to assign the entire player column the same name, you do not need to select any values:

  df$player[]  <-  "Ed"   # you can omit the brackets [], which are shown just for emphasis

you can then do the same for df$team


On a side note, when you create your data frame, if you plan to add values other than those already there, you will likely want to add stringsAsFactors=FALSE

 data.frame( . , stringsAsFactors=FALSE)

Upvotes: 1

Related Questions