Counting rows until a condition is met in R - NAs before the condition is met

Question

First of all, I'm somewhat new to R and I'm having a trouble managing some time series data. I found a solution which works (Code below), but is awfully slow on larger datasets (35min for 1 variable on 750k rows).

What I'm trying to achieve is that for every time USAGE value is over some pre-defined value (usage_limit) it starts counting the rows until it's over the same value again, when it resets the counter. For each client it starts with NA and is NA until it passes the usage_limit, when the counter is changed to 0. If NA now shows up in USAGE when the counter has already been changed to 0, it counts normally. Or in simpler terms, I'm trying to create a variable which shows how many rows (or in my case months) in the past USAGE was over the usage_limit by user.

This is the dummy data and expected output and loop used for calculating USAGE_35PCT_MTH. This is done on R 3.5.1, lubridate 1.7.4 and tidyverse 1.3.0

library(lubridate)
library(tidyverse)

dummy_tb <- tibble("USER_ID"=c("000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "200000", "200000", "200000", "200000", "200000", "200000", "200000", "200000"),
                   "REFERENCE_DATE"=c("31.01.2016", "29.02.2016", "31.03.2016", "30.04.2016", "31.05.2016", "30.06.2016", "31.07.2016", "31.08.2016", "30.09.2016", "31.10.2016", "30.11.2016", "31.12.2016", "31.01.2017", "28.02.2017", "31.03.2017", "31.03.2014", "30.04.2014", "31.05.2014", "30.06.2014", "31.07.2014", "31.08.2014", "30.09.2014", "31.10.2014"),
                   "USAGE"=c(0.30, 0.35, 0.34, 0.38, 0.40, 0.70, 0.78, 0.95, 0.36, 0.22, 0.11, 0.01, 0.1, 0.1, 0.1, NA, 0.36, 0.2, NA, 0.2, 0.2, NA, 0.2),
                   "USAGE_35PCT_MTH"=c(NA, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, NA, 0, 1, 2, 3, 4, 5, 6))

dummy_tb$REFERENCE_DATE <- as_datetime(dummy_tb$REFERENCE_DATE, format="%d.%m.%Y")
dummy_tb$REFERENCE_DATE <- as_date(dummy_tb$REFERENCE_DATE)

dummy_tb <- dummy_tb %>%
    arrange(USER_ID, REFERENCE_DATE) %>%
    mutate("USAGE_35PCT_MTH"=NA)

counter <- NA
user_curr <- ""
user_prev <- ""
usage_limit <- 0.35


for (row in 1:nrow(dummy_tb)){
    user_curr <- dummy_tb[row, "USER_ID"]
    if (user_curr != user_prev ) {
        counter <- NA
    }

    checking_value <- dummy_tb[row, "USAGE"]

    if (!is.na(checking_value)){
        if (checking_value >= usage_limit) {
            counter <- 0
        }
    }
    dummy_tb[row, "USAGE_35PCT_MTH"] <- counter
    counter <- counter + 1
    user_prev <- user_curr 
}

So my question is, is there a way to speed this up? I've been trying to figure out a way with Dplyr, but haven't struck gold yet.

Thanks for help!

Ronak Shah · Accepted Answer

Here's a way with dplyr :

library(dplyr)

dummy_tb %>%
  #Replace `NA` with 0
  mutate(USAGE = replace(USAGE, is.na(USAGE), 0)) %>%
  #Group by USER_ID
  group_by(USER_ID) %>%
  #Create a new group which resets everytime USAGE is greater than usage_limit
  group_by(temp = cumsum(USAGE >= usage_limit), add = TRUE) %>%
  #Create an index
  mutate(out = row_number() - 1) %>%
  group_by(USER_ID) %>%
  #Replace with NA values before first usage_limit cross.
  mutate(out = replace(out, row_number() < which.max(USAGE >= usage_limit), NA))

which returns :

#   USER_ID REFERENCE_DATE USAGE USAGE_35PCT_MTH temp out
#1   000001     31.01.2016  0.30              NA    0  NA
#2   000001     29.02.2016  0.35               0    1   0
#3   000001     31.03.2016  0.34               1    1   1
#4   000001     30.04.2016  0.38               0    2   0
#5   000001     31.05.2016  0.40               0    3   0
#6   000001     30.06.2016  0.70               0    4   0
#7   000001     31.07.2016  0.78               0    5   0
#8   000001     31.08.2016  0.95               0    6   0
#9   000001     30.09.2016  0.36               0    7   0
#10  000001     31.10.2016  0.22               1    7   1
#11  000001     30.11.2016  0.11               2    7   2
#12  000001     31.12.2016  0.01               3    7   3
#13  000001     31.01.2017  0.10               4    7   4
#14  000001     28.02.2017  0.10               5    7   5
#15  000001     31.03.2017  0.10               6    7   6
#16  200000     31.03.2014  0.00              NA    0  NA
#17  200000     30.04.2014  0.36               0    1   0
#18  200000     31.05.2014  0.20               1    1   1
#19  200000     30.06.2014  0.00               2    1   2
#20  200000     31.07.2014  0.20               3    1   3
#21  200000     31.08.2014  0.20               4    1   4
#22  200000     30.09.2014  0.00               5    1   5
#23  200000     31.10.2014  0.20               6    1   6

Counting rows until a condition is met in R - NAs before the condition is met

Answers (2)

Related Questions