Reputation: 194
So I basically need to find the average time per stage from the following data frame. I'm pretty competent in R but am unsure how to complete this task. Would dplyr
be the best method?
Client Stage Stage.Start
1 Client A Stage 1 2017/01/01
2 Client B Stage 1 2017/03/04
3 Client C Stage 2 2017/03/10
4 Client A Stage 2 2017/02/03
5 Client A Stage 3 2017/06/01
6 Client C Stage 3 2017/09/09
Expected output:
Client Stage Stage.Start Stage.Duration
1 Client A Stage 1 2017/01/01 31 days
2 Client B Stage 1 2017/03/04 NA
3 Client C Stage 2 2017/03/10 180 days
4 Client A Stage 2 2017/02/03 118 days
5 Client A Stage 3 2017/06/01 NA
6 Client C Stage 3 2017/09/09 NA
Roll-up:
Stage Avg.Duration
Stage 1 31 days
Stage 2 149 days
Stage 3 NA
Upvotes: 0
Views: 112
Reputation: 331
If I understand you correctly, you can use dplyr::group_by()
and summarise()
for this problem.
df <- tribble(
~Client, ~Stage, ~Stage.Start,
"Client A", "Stage 1", "2017/01/01",
"Client B", "Stage 1", "2017/03/04",
"Client C", "Stage 2", "2017/03/10",
"Client A", "Stage 2", "2017/02/03",
"Client A", "Stage 3", "2017/06/01",
"Client C", "Stage 4", "2017/09/09"
)
df$Client <- factor(df$Client)
df$Stage <- factor(df$Stage)
df$Stage.Start <- lubridate::ymd(as.Date(df$Stage.Start))
lags <- df %>% group_by(Client) %>%
mutate(
lag_time = lag(Stage.Start),
time_diff = Stage.Start - lag_time
)
mean_by_stage <- lags %>%
group_by(Stage) %>%
summarise(
mean_diff = mean(time_diff, na.rm = TRUE)
)
Edit -- a look at the output:
lags
# A tibble: 6 x 5
# Groups: Client [3]
Client Stage Stage.Start lag_time time_diff
<fctr> <fctr> <date> <date> <time>
1 Client A Stage 1 2017-01-01 NA NA days
2 Client B Stage 1 2017-03-04 NA NA days
3 Client C Stage 2 2017-03-10 NA NA days
4 Client A Stage 2 2017-02-03 2017-01-01 33 days
5 Client A Stage 3 2017-06-01 2017-02-03 118 days
6 Client C Stage 4 2017-09-09 2017-03-10 183 days
mean_by_stage
# A tibble: 4 x 2
Stage mean_diff
<fctr> <time>
1 Stage 1 NaN days
2 Stage 2 33 days
3 Stage 3 118 days
4 Stage 4 183 days
Upvotes: 3
Reputation: 42572
If I understand the question correctly, the code below should give the desired answer:
library(data.table)
setorder(DT, Client, Stage)
DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start,
by = Client][, .(avg.duration = mean(duration)), by = Stage]
Stage avg.duration 1: Stage 1 137.5 2: Stage 2 150.5 3: Stage 3 153.0 4: Stage 4 53.0
Note that I have used the actual date for stages which are not completed yet to avoid NA
and to get a duration to date.
Alternatively,
DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][
, .(avg.duration = mean(duration, na.rm = TRUE)), by = Stage]
will return the expected result (except for slight variations in the figures)
Stage avg.duration 1: Stage 1 33.0 2: Stage 2 150.5 3: Stage 3 NaN 4: Stage 4 NaN
OP's intention might become clearer if the data frame is properly ordered by Client
and Stage
:
setorder(DT, Client, Stage)
DT
id Client Stage Stage.Start 1: 1 Client A Stage 1 2017-01-01 2: 4 Client A Stage 2 2017-02-03 3: 5 Client A Stage 3 2017-06-01 4: 2 Client B Stage 1 2017-03-04 5: 3 Client C Stage 2 2017-03-10 6: 6 Client C Stage 4 2017-09-09
Then the durations are calculated for each client, either using the actual date (for a duration to date):
DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start,
by = Client][]
id Client Stage Stage.Start duration 1: 1 Client A Stage 1 2017-01-01 33 2: 4 Client A Stage 2 2017-02-03 118 3: 5 Client A Stage 3 2017-06-01 153 4: 2 Client B Stage 1 2017-03-04 242 5: 3 Client C Stage 2 2017-03-10 183 6: 6 Client C Stage 4 2017-09-09 53
or NA
by default:
DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][]
id Client Stage Stage.Start duration 1: 1 Client A Stage 1 2017-01-01 33 2: 4 Client A Stage 2 2017-02-03 118 3: 5 Client A Stage 3 2017-06-01 NA 4: 2 Client B Stage 1 2017-03-04 NA 5: 3 Client C Stage 2 2017-03-10 183 6: 6 Client C Stage 4 2017-09-09 NA
library(data.table)
DT <- fread("id, Client, Stage, Stage.Start
1, Client A, Stage 1, 2017/01/01
2, Client B, Stage 1, 2017/03/04
3, Client C, Stage 2, 2017/03/10
4, Client A, Stage 2, 2017/02/03
5, Client A, Stage 3, 2017/06/01
6, Client C, Stage 4, 2017/09/09")
DT[, Stage.Start := as.IDate(Stage.Start, "%Y/%m/%d")]
Upvotes: 3