Piccinin1992
Piccinin1992

Reputation: 249

Possible to replicate this in R?

For this question, i have given just one CaseNo as an example. In my df, i have millions.

Essentially what i want if the submission time is same for case no, the first row should equal one and all the other rows should equal 0 for that submission time, then for the submission time that is different, it should be 2 and so on.

df

CaseNo<-c(1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096)
SubmissionTime<-c("2019-12-06  2:06:59 PM","2019-12-06  2:06:59 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-12  7:25:03 PM","2020-05-12  7:25:03 PM","2020-05-12  7:25:03 PM","2020-05-12  7:25:03 PM")

What i want:

CaseNo<-c(1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096,1834096)
SubmissionTime<-c("2019-12-06  2:06:59 PM","2019-12-06  2:06:59 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-08  2:18:54 PM","2020-05-12  7:25:03 PM","2020-05-12  7:25:03 PM","2020-05-12  7:25:03 PM","2020-05-12  7:25:03 PM")
Rank<-c(1,0,2,0,0,0,0,3,0,0,0)

For every new case, it should start from 1.

Also, i have already arranged the df such that it goes from the oldest submissiontime to the most recent

I was thinking a combination of dense_rank and/or rank using dplyr?

Upvotes: 1

Views: 107

Answers (2)

bouncyball
bouncyball

Reputation: 10761

I wrote a function to perform this operation. This function does assume the data is already ordered, as you mentioned this in your original question. The na.omit bit is because you get an NA when operating on the first value in the vector.

Since the operations are wrapped in a function, it may be helpful to examine each line step-by-step outside the function.

make_rank <- function(x){
  # compare current value to previous value, take cumulative sum
  x_1 <- cumsum(c(1, na.omit(x != lag(x))))
  # compare again, replacing with 0 where values match
  c(1, na.omit(ifelse(x_1 != lag(x_1), x_1, 0)))
}

make_rank(SubmissionTime)
# [1] 1 0 2 0 0 0 0 3 0 0 0

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101838

Assuming your data has been sorted by both CaseNo and SubmissintTime, here is a base R option using ave+ cumsum

within(
  df,
  Rank <- ave(cumsum(!duplicated(df)),
    CaseNo,
    SubmissionTime,
    FUN = function(x) replace(x, seq_along(x) > 1, 0)
  )
)

which gives

    CaseNo         SubmissionTime Rank
1  1834096 2019-12-06  2:06:59 PM    1
2  1834096 2019-12-06  2:06:59 PM    0
3  1834096 2020-05-08  2:18:54 PM    2
4  1834096 2020-05-08  2:18:54 PM    0
5  1834096 2020-05-08  2:18:54 PM    0
6  1834096 2020-05-08  2:18:54 PM    0
7  1834096 2020-05-08  2:18:54 PM    0
8  1834096 2020-05-12  7:25:03 PM    3
9  1834096 2020-05-12  7:25:03 PM    0
10 1834096 2020-05-12  7:25:03 PM    0
11 1834096 2020-05-12  7:25:03 PM    0

data

> dput(df)
structure(list(CaseNo = c(1834096, 1834096, 1834096, 1834096, 
1834096, 1834096, 1834096, 1834096, 1834096, 1834096, 1834096
), SubmissionTime = c("2019-12-06  2:06:59 PM", "2019-12-06  2:06:59 PM",
"2020-05-08  2:18:54 PM", "2020-05-08  2:18:54 PM", "2020-05-08  2:18:54 PM",
"2020-05-08  2:18:54 PM", "2020-05-08  2:18:54 PM", "2020-05-12  7:25:03 PM", 
"2020-05-12  7:25:03 PM", "2020-05-12  7:25:03 PM", "2020-05-12  7:25:03 PM"
)), class = "data.frame", row.names = c(NA, -11L))

Upvotes: 1

Related Questions