Reputation: 808
I have dataset where each row represents a new test taken by individuals. It has four variables.
1) IDs of the test takers:
id <- c(1, 1, 1, 2, 2)
2) Dates the individuals took the test:
dates <- as.Date(c("2007-06-22", "2008-06-21", "2009-06-22", "2008-06-21", "2009-06-22"))
3) Scores they received on that test:
scores <- c(0, 12, 12, 1, 3)
4) Whether or not that score was the best score of the individual up to that time point.
improvement <- c("No", "Yes", "No", "No", "Yes")
So the dataset is:
df <- data.frame(id, dates, scores, improvement)
id dates scores improvement
1 1 2007-06-22 0 No
2 1 2008-06-21 12 Yes
3 1 2009-06-22 12 No
4 2 2008-06-21 1 No
5 2 2009-06-22 3 Yes
I've got a problem though. A score of 12 is the highest. So if someone gets a 12, there would be no more room for improvement. Do you know how I could make it so that when someone gets a 12, on any subsequent rows they get NA on improvement?
i.e.,
id dates scores improvement
1 1 2007-06-22 0 No
2 1 2008-06-21 12 Yes
3 1 2009-06-22 12 NA
4 2 2008-06-21 1 No
5 2 2009-06-22 3 Yes
Upvotes: 3
Views: 112
Reputation: 887881
Here is an option using data.table
. We convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'id', order
the 'id' and 'dates', get the logical index for maximum scores (scores == max(scores)
), find the cumulative sum (cumsum(...)
), convert it to a logical vector
(>1
) and get the row index (.I
). Specify the row index in i
and assign (:=
) the elements in 'improvement' that corresponds to that index to NA
library(data.table)
setDT(df)[df[order(id, dates), .I[cumsum(scores == max(scores))>1],
by = id]$V1, improvement := NA]
df
# id dates scores improvement
#1: 1 2007-06-22 0 No
#2: 1 2008-06-21 12 Yes
#3: 1 2009-06-22 12 NA
#4: 2 2008-06-21 1 No
#5: 2 2009-06-22 3 Yes
If the max values are not adjacent, either we can order
by 'scores' as well or another option is
setDT(df1)[df1[order(id, dates), .I[cumsum(scores == max(scores))>1 &
scores ==max(scores)], by = id]$V1, improvement := NA]
df1
# id dates scores improvement
#1: 1 2007-06-22 0 No
#2: 1 2008-06-21 12 Yes
#3: 1 2009-06-22 5 No
#4: 1 2010-06-21 12 NA
#5: 2 2008-06-21 1 No
#6: 2 2009-06-22 3 Yes
A slight improvement to the above is to call the scores==max(scores)
one time by creating an object
setDT(df1)[df1[order(id, dates), {mx <- scores == max(scores)
.I[cumsum(mx)>1 & mx]},
by = id]$V1, improvement := NA]
df1 <- structure(list(id = c(1, 1, 1, 1, 2, 2), dates = structure(c(13686,
14051, 14417, 14781, 14051, 14417), class = "Date"), scores = c(0,
12, 5, 12, 1, 3), improvement = structure(c(1L, 2L, 1L, 1L, 1L,
2L), .Label = c("No", "Yes"), class = "factor")), .Names = c("id",
"dates", "scores", "improvement"), row.names = c(NA, -6L),
class = "data.frame")
Upvotes: 1
Reputation: 93871
How about this: We use dplyr
to group by id
, then for each id
we check whether any score is equal to 12. If so, then we replace every value of improvement
with NA
in subsequent rows after the first instance of a score of 12.
library(dplyr)
df %>% group_by(id) %>% arrange(id, dates) %>%
mutate(improvement = replace(improvement, if(any(scores==12)) (min(which(scores==12))+1):n(), NA))
id dates scores improvement <dbl> <date> <dbl> <fctr> 1 1 2007-06-22 0 No 2 1 2008-06-21 12 Yes 3 1 2009-06-22 12 NA 4 2 2008-06-21 1 No 5 2 2009-06-22 3 Yes
Upvotes: 3
Reputation: 1994
the operation can also be done just using the base package. Though it is bit messy, it creates opportunity for people who aren't familiar with the dplyr
package, etc. Here is the code and I have explained my logic as comments in the code:
## Note: you cannot have factor levels in the `improvement` column
df$id <- as.character(df$id) ##IMPORTANT
df$improvement <- as.character(df$improvement) ##really important
new_df <- NULL #new data frame; placeholder for now
for(test_taker in unique(df$id)) {
## Sub-dataframe for each individual's record:
sub_df <- df[df$id == test_taker, ]
## For each individual's record, look for score of 12
## If there is such a score that occur more than once
## change the second score of 12 and beyond to NA
indices <- which(df$scores[df$id == test_taker] %in% c(12))
if(sum(indices) > 1) {
sub_df[indices[2:length(indices)], "improvement"] <-NA
}
## Update the new data.frame
new_df <- rbind(new_df, sub_df)
}
new_df
## id dates scores improvement
## 1 1 2007-06-22 0 No
## 2 1 2008-06-21 12 Yes
## 3 1 2009-06-22 12 <NA>
## 4 2 2008-06-21 1 No
## 5 2 2009-06-22 3 Yes
Upvotes: 1