Reputation: 65
First of all, I'm somewhat new to R and I'm having a trouble managing some time series data. I found a solution which works (Code below), but is awfully slow on larger datasets (35min for 1 variable on 750k rows).
What I'm trying to achieve is that for every time USAGE
value is over some pre-defined value (usage_limit
) it starts counting the rows until it's over the same value again, when it resets the counter. For each client it starts with NA and is NA until it passes the usage_limit
, when the counter is changed to 0. If NA now shows up in USAGE
when the counter has already been changed to 0, it counts normally. Or in simpler terms, I'm trying to create a variable which shows how many rows (or in my case months) in the past USAGE
was over the usage_limit
by user.
This is the dummy data and expected output and loop used for calculating USAGE_35PCT_MTH
. This is done on R 3.5.1, lubridate 1.7.4 and tidyverse 1.3.0
library(lubridate)
library(tidyverse)
dummy_tb <- tibble("USER_ID"=c("000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "000001", "200000", "200000", "200000", "200000", "200000", "200000", "200000", "200000"),
"REFERENCE_DATE"=c("31.01.2016", "29.02.2016", "31.03.2016", "30.04.2016", "31.05.2016", "30.06.2016", "31.07.2016", "31.08.2016", "30.09.2016", "31.10.2016", "30.11.2016", "31.12.2016", "31.01.2017", "28.02.2017", "31.03.2017", "31.03.2014", "30.04.2014", "31.05.2014", "30.06.2014", "31.07.2014", "31.08.2014", "30.09.2014", "31.10.2014"),
"USAGE"=c(0.30, 0.35, 0.34, 0.38, 0.40, 0.70, 0.78, 0.95, 0.36, 0.22, 0.11, 0.01, 0.1, 0.1, 0.1, NA, 0.36, 0.2, NA, 0.2, 0.2, NA, 0.2),
"USAGE_35PCT_MTH"=c(NA, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, NA, 0, 1, 2, 3, 4, 5, 6))
dummy_tb$REFERENCE_DATE <- as_datetime(dummy_tb$REFERENCE_DATE, format="%d.%m.%Y")
dummy_tb$REFERENCE_DATE <- as_date(dummy_tb$REFERENCE_DATE)
dummy_tb <- dummy_tb %>%
arrange(USER_ID, REFERENCE_DATE) %>%
mutate("USAGE_35PCT_MTH"=NA)
counter <- NA
user_curr <- ""
user_prev <- ""
usage_limit <- 0.35
for (row in 1:nrow(dummy_tb)){
user_curr <- dummy_tb[row, "USER_ID"]
if (user_curr != user_prev ) {
counter <- NA
}
checking_value <- dummy_tb[row, "USAGE"]
if (!is.na(checking_value)){
if (checking_value >= usage_limit) {
counter <- 0
}
}
dummy_tb[row, "USAGE_35PCT_MTH"] <- counter
counter <- counter + 1
user_prev <- user_curr
}
So my question is, is there a way to speed this up? I've been trying to figure out a way with Dplyr, but haven't struck gold yet.
Thanks for help!
Upvotes: 4
Views: 1341
Reputation: 65
I would just like to add an addendum, which I didn't specify in the first question. While Ronak Shah's anwser worked wonderfully for the initial problem, I had an issue where a USER_ID
had all NA
values throughout the data.frame
. In Ronak's anwser it would normally count from 0 to the number of rows a user had.
I wanted to have NA
values in such case. I just added a few lines to fulfill this requirement.
library(dplyr)
dummy_tb %>%
#Replace `NA` with 0
mutate(USAGE = replace(USAGE, is.na(USAGE), 0)) %>%
#Group by USER_ID
group_by(USER_ID) %>%
#Create a new group which resets everytime USAGE is greater than usage_limit
group_by(temp = cumsum(USAGE >= usage_limit), add = TRUE) %>%
#Create an index
mutate(out = row_number() - 1) %>%
group_by(USER_ID) %>%
#Replace with NA values before first usage_limit cross.
mutate(out = replace(out, row_number() < which.max(USAGE >= usage_limit), NA)) %>%
#Ungroup to reset grouping
ungroup() %>%
#group by USER_ID again
group_by(USER_ID) %>%
#check if all USAGE values are NA by USER_ID
mutate(out_temp = all(is.na(USAGE))) %>%
#replace where out_temp == TRUE
mutate(out, replace(out, out_temp, NA))
edit:
Similarly there was an issue if USAGE
never crossed usage_limit
. It normally counted the months, which should've been NA, since USAGE
never crossed usage_limit
. I added another similar check as previously, just if all temp
values by USER_ID
are 0
as this means it never changed values it also never crossed usage_limit
.
at the end added these lines
ungroup() %>%
group_by(USER_ID) %>%
mutate(out_temp = all(temp==0) %>%
mutate(out, replace(out, out_temp, NA)) %>%
ungroup()
Upvotes: 0
Reputation: 388862
Here's a way with dplyr
:
library(dplyr)
dummy_tb %>%
#Replace `NA` with 0
mutate(USAGE = replace(USAGE, is.na(USAGE), 0)) %>%
#Group by USER_ID
group_by(USER_ID) %>%
#Create a new group which resets everytime USAGE is greater than usage_limit
group_by(temp = cumsum(USAGE >= usage_limit), add = TRUE) %>%
#Create an index
mutate(out = row_number() - 1) %>%
group_by(USER_ID) %>%
#Replace with NA values before first usage_limit cross.
mutate(out = replace(out, row_number() < which.max(USAGE >= usage_limit), NA))
which returns :
# USER_ID REFERENCE_DATE USAGE USAGE_35PCT_MTH temp out
#1 000001 31.01.2016 0.30 NA 0 NA
#2 000001 29.02.2016 0.35 0 1 0
#3 000001 31.03.2016 0.34 1 1 1
#4 000001 30.04.2016 0.38 0 2 0
#5 000001 31.05.2016 0.40 0 3 0
#6 000001 30.06.2016 0.70 0 4 0
#7 000001 31.07.2016 0.78 0 5 0
#8 000001 31.08.2016 0.95 0 6 0
#9 000001 30.09.2016 0.36 0 7 0
#10 000001 31.10.2016 0.22 1 7 1
#11 000001 30.11.2016 0.11 2 7 2
#12 000001 31.12.2016 0.01 3 7 3
#13 000001 31.01.2017 0.10 4 7 4
#14 000001 28.02.2017 0.10 5 7 5
#15 000001 31.03.2017 0.10 6 7 6
#16 200000 31.03.2014 0.00 NA 0 NA
#17 200000 30.04.2014 0.36 0 1 0
#18 200000 31.05.2014 0.20 1 1 1
#19 200000 30.06.2014 0.00 2 1 2
#20 200000 31.07.2014 0.20 3 1 3
#21 200000 31.08.2014 0.20 4 1 4
#22 200000 30.09.2014 0.00 5 1 5
#23 200000 31.10.2014 0.20 6 1 6
Upvotes: 2