Reputation: 3384
library(data.table)
library(dplyr)
dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
"score"=c(0,8,8,7,9,4,9,2,7),
"time"=c(1,2,3,1,2,3,1,2,3))
dataWANT=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
"score"=c(0,8,8,7,9,4,9,2,7),
"time"=c(1,2,3,1,2,3,1,2,3),
"score3"=c(1,1,1,0,0,0,1,1,1),
"timescore3"=c(1,1,1,3,3,3,2,2,2),
"score7"=c(1,1,1,1,1,1,1,1,1),
"timescore7"=c(1,1,1,1,1,1,2,2,2))
dataHAVE[, score3 := ifelse(score<=3,
time[which.min(score<=3)],
time[which.max(time)]), by=student]
I have 'dataHAVE' and want to generate 'dataWANT'
1) score3 equals to 1 if a student has any score less than or equals to 3; otherwise 0
2) score7 equals to 1 if a student has any score less than or equals to 7; otherwise 0
3) timescore3 equals to the minimum time value at which a student scores a 3 or less; if a student does not score a 3 or less as indicated by score equals to 3, then timecsore3 is the maximum time for that student.
4) timescore7 equals to the minimum time value at which a student scores a 7 or less; if a student does not score a 7 or less as indicated by score equals to 7, then timecsore7 is the maximum time for that student.
My attempt at this shows above but does not work and please note, I attempted Base R and dplyr
but the dataset is so big that these take a very long time. A data.table solution is ideal.
NEW DATA TO HANDLE MISSING::
dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA),
"time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
Updated DATA with missing 'time'
dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7),
"score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA,6,9,3,NA,NA,NA),
"time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,NA,2,NA,NA,NA,NA))
Upvotes: 1
Views: 65
Reputation: 887108
Based on the logic showed, it can be done with data.table
library(data.table)
setDT(dataHAVE)[,c("score3", "timescore3", "score7", "timescore7") := {
sc3 <- score <=3
sc7 <- score <= 7
tsc3 <- if(any(sc3)) min(time[sc3]) else max(time)
tsc7 <- if(any(sc7)) min(time[sc7]) else max(time)
.(+(any(sc3)), tsc3, +(any(sc7)),tsc7 )}, .(student)]
If there are missing values, then use
setDT(dataHAVE)[,c("score3", "timescore3", "score7", "timescore7") := {
sc3 <- score <=3 & !is.na(score)
sc7 <- score <= 7 & !is.na(score)
tsc3 <- if(any(sc3)) min(time[sc3], na.rm = TRUE) else max(time, na.rm = TRUE)
tsc7 <- if(any(sc7)) min(time[sc7], na.rm = TRUE) else max(time, na.rm = TRUE)
.(+(any(sc3)), tsc3, +(any(sc7)),tsc7 )}, .(student)]
Upvotes: 1