bvowe
bvowe

Reputation: 3384

Data Table Solution To New Variables By Group

library(data.table)
library(dplyr)

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
                    "score"=c(0,8,8,7,9,4,9,2,7),
                    "time"=c(1,2,3,1,2,3,1,2,3))


dataWANT=data.frame("student"=c(1,1,1,2,2,2,3,3,3),
                    "score"=c(0,8,8,7,9,4,9,2,7),
                    "time"=c(1,2,3,1,2,3,1,2,3),
                    "score3"=c(1,1,1,0,0,0,1,1,1),
                    "timescore3"=c(1,1,1,3,3,3,2,2,2),
                    "score7"=c(1,1,1,1,1,1,1,1,1),
                    "timescore7"=c(1,1,1,1,1,1,2,2,2))




dataHAVE[, score3 := ifelse(score<=3,
                               time[which.min(score<=3)],
                               time[which.max(time)]), by=student]

I have 'dataHAVE' and want to generate 'dataWANT'

1) score3 equals to 1 if a student has any score less than or equals to 3; otherwise 0

2) score7 equals to 1 if a student has any score less than or equals to 7; otherwise 0

3) timescore3 equals to the minimum time value at which a student scores a 3 or less; if a student does not score a 3 or less as indicated by score equals to 3, then timecsore3 is the maximum time for that student.

4) timescore7 equals to the minimum time value at which a student scores a 7 or less; if a student does not score a 7 or less as indicated by score equals to 7, then timecsore7 is the maximum time for that student.

My attempt at this shows above but does not work and please note, I attempted Base R and dplyr but the dataset is so big that these take a very long time. A data.table solution is ideal.

NEW DATA TO HANDLE MISSING::

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
                    "score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA),
                    "time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))

Updated DATA with missing 'time'

dataHAVE=data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7),
                    "score"=c(0,8,8,7,9,4,9,2,7,NA,4,7,NA,NA,NA,6,9,3,NA,NA,NA),
                    "time"=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,NA,2,NA,NA,NA,NA))

Upvotes: 1

Views: 65

Answers (1)

akrun
akrun

Reputation: 887108

Based on the logic showed, it can be done with data.table

library(data.table)
setDT(dataHAVE)[,c("score3", "timescore3", "score7", "timescore7") := {
                   sc3 <- score <=3
                   sc7 <-  score <= 7
                   tsc3 <- if(any(sc3)) min(time[sc3]) else max(time)
                   tsc7 <- if(any(sc7)) min(time[sc7]) else max(time)
         .(+(any(sc3)), tsc3, +(any(sc7)),tsc7 )}, .(student)]

If there are missing values, then use

setDT(dataHAVE)[,c("score3", "timescore3", "score7", "timescore7") := {
                   sc3 <- score <=3 & !is.na(score)
                    sc7 <-  score <= 7 & !is.na(score)
                    tsc3 <- if(any(sc3)) min(time[sc3], na.rm = TRUE) else max(time, na.rm = TRUE)
                    tsc7 <- if(any(sc7)) min(time[sc7], na.rm = TRUE) else max(time, na.rm = TRUE)
          .(+(any(sc3)), tsc3, +(any(sc7)),tsc7 )}, .(student)]

Upvotes: 1

Related Questions