Reputation: 808

Replace subsequent values with NA if prior values max possible value on separate column

I have dataset where each row represents a new test taken by individuals. It has four variables.

1) IDs of the test takers:

id <- c(1, 1, 1, 2, 2)

2) Dates the individuals took the test:

dates <- as.Date(c("2007-06-22", "2008-06-21", "2009-06-22", "2008-06-21", "2009-06-22"))

3) Scores they received on that test:

scores <- c(0, 12, 12, 1, 3)

4) Whether or not that score was the best score of the individual up to that time point.

improvement <- c("No", "Yes", "No", "No", "Yes")

So the dataset is:

df <- data.frame(id, dates, scores, improvement) 

  id      dates scores improvement
1  1 2007-06-22      0          No
2  1 2008-06-21     12         Yes
3  1 2009-06-22     12          No
4  2 2008-06-21      1          No
5  2 2009-06-22      3         Yes

I've got a problem though. A score of 12 is the highest. So if someone gets a 12, there would be no more room for improvement. Do you know how I could make it so that when someone gets a 12, on any subsequent rows they get NA on improvement?

i.e.,

   id      dates scores improvement
 1  1 2007-06-22      0          No
 2  1 2008-06-21     12         Yes
 3  1 2009-06-22     12          NA
 4  2 2008-06-21      1          No
 5  2 2009-06-22      3         Yes

Upvotes: 3

Answers (3)

akrun

Reputation: 887881

Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', order the 'id' and 'dates', get the logical index for maximum scores (scores == max(scores)), find the cumulative sum (cumsum(...)), convert it to a logical vector (>1) and get the row index (.I). Specify the row index in i and assign (:=) the elements in 'improvement' that corresponds to that index to NA

library(data.table)
setDT(df)[df[order(id, dates), .I[cumsum(scores == max(scores))>1], 
                             by = id]$V1, improvement := NA]
df
#    id      dates scores improvement
#1:  1 2007-06-22      0          No
#2:  1 2008-06-21     12         Yes
#3:  1 2009-06-22     12          NA
#4:  2 2008-06-21      1          No
#5:  2 2009-06-22      3         Yes

Update

If the max values are not adjacent, either we can order by 'scores' as well or another option is

setDT(df1)[df1[order(id, dates), .I[cumsum(scores == max(scores))>1 & 
               scores ==max(scores)], by = id]$V1, improvement := NA]
df1
#     id      dates scores improvement
#1:  1 2007-06-22      0          No
#2:  1 2008-06-21     12         Yes
#3:  1 2009-06-22      5          No
#4:  1 2010-06-21     12          NA
#5:  2 2008-06-21      1          No
#6:  2 2009-06-22      3         Yes

A slight improvement to the above is to call the scores==max(scores) one time by creating an object

setDT(df1)[df1[order(id, dates), {mx <- scores == max(scores)
                              .I[cumsum(mx)>1 & mx]},  
                      by = id]$V1, improvement := NA]

data

df1 <- structure(list(id = c(1, 1, 1, 1, 2, 2), dates = structure(c(13686, 
14051, 14417, 14781, 14051, 14417), class = "Date"), scores = c(0, 
12, 5, 12, 1, 3), improvement = structure(c(1L, 2L, 1L, 1L, 1L, 
2L), .Label = c("No", "Yes"), class = "factor")), .Names = c("id", 
"dates", "scores", "improvement"), row.names = c(NA, -6L),
 class = "data.frame")

Upvotes: 1

eipi10

Reputation: 93871

How about this: We use dplyr to group by id, then for each id we check whether any score is equal to 12. If so, then we replace every value of improvement with NA in subsequent rows after the first instance of a score of 12.

library(dplyr)

df %>% group_by(id) %>% arrange(id, dates) %>%
  mutate(improvement = replace(improvement, if(any(scores==12)) (min(which(scores==12))+1):n(), NA))

     id      dates scores improvement
  <dbl>     <date>  <dbl>      <fctr>
1     1 2007-06-22      0          No
2     1 2008-06-21     12         Yes
3     1 2009-06-22     12          NA
4     2 2008-06-21      1          No
5     2 2009-06-22      3         Yes

Upvotes: 3

David C.

Reputation: 1994

the operation can also be done just using the base package. Though it is bit messy, it creates opportunity for people who aren't familiar with the dplyr package, etc. Here is the code and I have explained my logic as comments in the code:

## Note: you cannot have factor levels in the `improvement` column
df$id <- as.character(df$id) ##IMPORTANT
df$improvement <- as.character(df$improvement) ##really important


new_df <- NULL #new data frame; placeholder for now
for(test_taker in unique(df$id)) {

  ## Sub-dataframe for each individual's record:
  sub_df <- df[df$id == test_taker, ]

  ## For each individual's record, look for score of 12
  ## If there is such a score that occur more than once
  ## change the second score of 12 and beyond to NA 
  indices <- which(df$scores[df$id == test_taker] %in% c(12))
  if(sum(indices) > 1) {
    sub_df[indices[2:length(indices)], "improvement"] <-NA
  }

  ## Update the new data.frame
  new_df <- rbind(new_df,  sub_df)
}
new_df

   ##    id      dates scores improvement
   ##  1  1 2007-06-22      0          No
   ##  2  1 2008-06-21     12         Yes
   ##  3  1 2009-06-22     12        <NA>
   ##  4  2 2008-06-21      1          No
   ##  5  2 2009-06-22      3         Yes

Upvotes: 1

Replace subsequent values with NA if prior values max possible value on separate column

Answers (3)

Update

data

Related Questions