Reputation: 249
For this question, i have given just one CaseNo as an example. In my df, i have millions.
Essentially what i want if the submission time is same for case no, the first row should equal one and all the other rows should equal 0 for that submission time, then for the submission time that is different, it should be 2 and so on.
df
CaseNo<-c(1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096)
SubmissionTime<-c("2019-12-06 2:06:59 PM","2019-12-06 2:06:59 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-12 7:25:03 PM","2020-05-12 7:25:03 PM","2020-05-12 7:25:03 PM","2020-05-12 7:25:03 PM")
What i want:
CaseNo<-c(1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096)
SubmissionTime<-c("2019-12-06 2:06:59 PM","2019-12-06 2:06:59 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-08 2:18:54 PM","2020-05-12 7:25:03 PM","2020-05-12 7:25:03 PM","2020-05-12 7:25:03 PM","2020-05-12 7:25:03 PM")
Rank<-c(1,0,2,0,0,0,0,3,0,0,0)
For every new case, it should start from 1.
Also, i have already arranged the df such that it goes from the oldest submissiontime to the most recent
I was thinking a combination of dense_rank and/or rank using dplyr?
Upvotes: 1
Views: 107
Reputation: 10761
I wrote a function to perform this operation. This function does assume the data is already ordered, as you mentioned this in your original question. The na.omit
bit is because you get an NA
when operating on the first value in the vector.
Since the operations are wrapped in a function, it may be helpful to examine each line step-by-step outside the function.
make_rank <- function(x){
# compare current value to previous value, take cumulative sum
x_1 <- cumsum(c(1, na.omit(x != lag(x))))
# compare again, replacing with 0 where values match
c(1, na.omit(ifelse(x_1 != lag(x_1), x_1, 0)))
}
make_rank(SubmissionTime)
# [1] 1 0 2 0 0 0 0 3 0 0 0
Upvotes: 1
Reputation: 101838
Assuming your data has been sorted by both CaseNo
and SubmissintTime
, here is a base R option using ave
+ cumsum
within(
df,
Rank <- ave(cumsum(!duplicated(df)),
CaseNo,
SubmissionTime,
FUN = function(x) replace(x, seq_along(x) > 1, 0)
)
)
which gives
CaseNo SubmissionTime Rank
1 1834096 2019-12-06 2:06:59 PM 1
2 1834096 2019-12-06 2:06:59 PM 0
3 1834096 2020-05-08 2:18:54 PM 2
4 1834096 2020-05-08 2:18:54 PM 0
5 1834096 2020-05-08 2:18:54 PM 0
6 1834096 2020-05-08 2:18:54 PM 0
7 1834096 2020-05-08 2:18:54 PM 0
8 1834096 2020-05-12 7:25:03 PM 3
9 1834096 2020-05-12 7:25:03 PM 0
10 1834096 2020-05-12 7:25:03 PM 0
11 1834096 2020-05-12 7:25:03 PM 0
data
> dput(df)
structure(list(CaseNo = c(1834096, 1834096, 1834096, 1834096,
1834096, 1834096, 1834096, 1834096, 1834096, 1834096, 1834096
), SubmissionTime = c("2019-12-06 2:06:59 PM", "2019-12-06 2:06:59 PM",
"2020-05-08 2:18:54 PM", "2020-05-08 2:18:54 PM", "2020-05-08 2:18:54 PM",
"2020-05-08 2:18:54 PM", "2020-05-08 2:18:54 PM", "2020-05-12 7:25:03 PM",
"2020-05-12 7:25:03 PM", "2020-05-12 7:25:03 PM", "2020-05-12 7:25:03 PM"
)), class = "data.frame", row.names = c(NA, -11L))
Upvotes: 1