MLE
MLE

Reputation: 195

get row number of group but with condition in R

I have a dataframe that has been sorted by userid and date as below:

df <- data.frame(
  userid = c("1", "1", "1", "2", "2", "3", "3"),
  date = c("2016-10-23", "2018-01-01", "2020-03-03", "2008-09-04", "2021-08-08", "2019-10-02", "2022-01-02")
) 

 userid       date
      1 2016-10-23
      1 2018-01-01
      1 2020-03-03
      2 2008-09-04
      2 2021-08-08
      3 2019-10-02
      3 2022-01-02

I want to group by userid and assign row numbers, BUT anything before 2017 should be row number 0:

df2 <- data.frame(
  userid = c("1", "1", "1", "2", "2", "3", "3"),
  date = c("2016-10-23", "2018-01-01", "2020-03-03", "2008-09-04", "2021-08-08", "2019-10-02", "2022-01-02"),
  rownumber = c("0", "1", "2", "0", "1", "1", "2")
)

 userid       date rownumber
      1 2016-10-23         0
      1 2018-01-01         1
      1 2020-03-03         2
      2 2008-09-04         0
      2 2021-08-08         1
      3 2019-10-02         1
      3 2022-01-02         2

As you can see, the dates with years 2016 and 2008 have been assigned as 0 and the rownumber increments from 1 excluding those.

I have tried the below, but it does not assign 0 to dates before 2017.

df %>%
  group_by(userid) %>%
  mutate(rownumber = row_number()) %>%
  ungroup()

Thanks!

Upvotes: 2

Views: 1073

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389265

You can extract the year from the date. For each userid subtract the row number by number of values that are <= 2017.

library(dplyr)

df %>%
  mutate(date = as.Date(date), 
         year = lubridate::year(date)) %>%
  arrange(userid, date) %>%
  group_by(userid) %>%
  mutate(rownumber = pmax(row_number() - sum(year <= 2017), 0)) %>%
  ungroup

#  userid date        year rownumber
#  <chr>  <date>     <dbl>     <int>
#1 1      2016-10-23  2016         0
#2 1      2018-01-01  2018         1
#3 1      2020-03-03  2020         2
#4 2      2008-09-04  2008         0
#5 2      2021-08-08  2021         1
#6 3      2019-10-02  2019         1
#7 3      2022-01-02  2022         2

Upvotes: 2

r2evans
r2evans

Reputation: 160952

df %>%
  mutate(date = as.Date(date)) %>%
  group_by(userid, year = (1900+as.POSIXlt(date)$year) < 2018) %>%
  mutate(rownumber = if (year[1]) 0 else order(date)) %>%
  ungroup() %>%
  select(-year)
# # A tibble: 7 x 3
#   userid date       rownumber
#   <chr>  <date>         <dbl>
# 1 1      2016-10-23         0
# 2 1      2018-01-01         1
# 3 1      2020-03-03         2
# 4 2      2008-09-04         0
# 5 2      2021-08-08         1
# 6 3      2019-10-02         1
# 7 3      2022-01-02         2

I could have used row_number() in place of order(date), but I didn't know for certain if your data was known to be pre-sorted. You can use row_number() for the same results here.

(FYI, your date is a string, so I also converted it to a proper Date class, for better or worse. If you don't want to do that, you can replace the first mutate and the (1900+...)<2018 with as.integer(substring(date,1,4)) < 2018 for the same effect.)

Upvotes: 0

Related Questions