rfortin
rfortin

Reputation: 194

How can I get average time per event from a data frame?

So I basically need to find the average time per stage from the following data frame. I'm pretty competent in R but am unsure how to complete this task. Would dplyr be the best method?

    Client   Stage Stage.Start
1 Client A Stage 1  2017/01/01
2 Client B Stage 1  2017/03/04
3 Client C Stage 2  2017/03/10
4 Client A Stage 2  2017/02/03
5 Client A Stage 3  2017/06/01
6 Client C Stage 3  2017/09/09

Expected output:

    Client   Stage Stage.Start Stage.Duration
1 Client A Stage 1  2017/01/01 31 days
2 Client B Stage 1  2017/03/04 NA
3 Client C Stage 2  2017/03/10 180 days
4 Client A Stage 2  2017/02/03 118 days
5 Client A Stage 3  2017/06/01 NA
6 Client C Stage 3  2017/09/09 NA

Roll-up:

Stage    Avg.Duration
Stage 1  31 days
Stage 2  149 days
Stage 3  NA

Upvotes: 0

Views: 112

Answers (2)

amanda
amanda

Reputation: 331

If I understand you correctly, you can use dplyr::group_by() and summarise() for this problem.

df <- tribble(
  ~Client,  ~Stage, ~Stage.Start,
   "Client A", "Stage 1",  "2017/01/01",
   "Client B", "Stage 1",  "2017/03/04",
   "Client C", "Stage 2",  "2017/03/10",
   "Client A", "Stage 2",  "2017/02/03",
   "Client A", "Stage 3",  "2017/06/01",
   "Client C", "Stage 4",  "2017/09/09"
)

df$Client <- factor(df$Client)
df$Stage <- factor(df$Stage)
df$Stage.Start <- lubridate::ymd(as.Date(df$Stage.Start))

lags <- df %>% group_by(Client) %>% 
  mutate(
    lag_time = lag(Stage.Start),
    time_diff = Stage.Start - lag_time
  ) 

mean_by_stage <- lags %>% 
  group_by(Stage) %>% 
  summarise(
    mean_diff = mean(time_diff, na.rm = TRUE)
  )
Edit -- a look at the output:
lags
# A tibble: 6 x 5
# Groups:   Client [3]
    Client   Stage Stage.Start   lag_time time_diff
    <fctr>  <fctr>      <date>     <date>    <time>
1 Client A Stage 1  2017-01-01         NA   NA days
2 Client B Stage 1  2017-03-04         NA   NA days
3 Client C Stage 2  2017-03-10         NA   NA days
4 Client A Stage 2  2017-02-03 2017-01-01   33 days
5 Client A Stage 3  2017-06-01 2017-02-03  118 days
6 Client C Stage 4  2017-09-09 2017-03-10  183 days


mean_by_stage   
# A tibble: 4 x 2
    Stage mean_diff
   <fctr>    <time>
1 Stage 1  NaN days
2 Stage 2   33 days
3 Stage 3  118 days
4 Stage 4  183 days

Upvotes: 3

Uwe
Uwe

Reputation: 42572

If I understand the question correctly, the code below should give the desired answer:

library(data.table)
setorder(DT, Client, Stage)
DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start, 
   by = Client][, .(avg.duration = mean(duration)), by = Stage]
     Stage avg.duration
1: Stage 1        137.5
2: Stage 2        150.5
3: Stage 3        153.0
4: Stage 4         53.0

Note that I have used the actual date for stages which are not completed yet to avoid NA and to get a duration to date.

Alternatively,

DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][
  , .(avg.duration = mean(duration, na.rm = TRUE)), by = Stage]

will return the expected result (except for slight variations in the figures)

     Stage avg.duration
1: Stage 1         33.0
2: Stage 2        150.5
3: Stage 3          NaN
4: Stage 4          NaN

Explanation

OP's intention might become clearer if the data frame is properly ordered by Client and Stage:

setorder(DT, Client, Stage)
DT
   id   Client   Stage Stage.Start
1:  1 Client A Stage 1  2017-01-01
2:  4 Client A Stage 2  2017-02-03
3:  5 Client A Stage 3  2017-06-01
4:  2 Client B Stage 1  2017-03-04
5:  3 Client C Stage 2  2017-03-10
6:  6 Client C Stage 4  2017-09-09

Then the durations are calculated for each client, either using the actual date (for a duration to date):

DT[, duration := shift(Stage.Start, type = "lead", fill = Sys.Date()) - Stage.Start, 
   by = Client][]
   id   Client   Stage Stage.Start duration
1:  1 Client A Stage 1  2017-01-01       33
2:  4 Client A Stage 2  2017-02-03      118
3:  5 Client A Stage 3  2017-06-01      153
4:  2 Client B Stage 1  2017-03-04      242
5:  3 Client C Stage 2  2017-03-10      183
6:  6 Client C Stage 4  2017-09-09       53

or NA by default:

DT[, duration := shift(Stage.Start, type = "lead") - Stage.Start, by = Client][]
   id   Client   Stage Stage.Start duration
1:  1 Client A Stage 1  2017-01-01       33
2:  4 Client A Stage 2  2017-02-03      118
3:  5 Client A Stage 3  2017-06-01       NA
4:  2 Client B Stage 1  2017-03-04       NA
5:  3 Client C Stage 2  2017-03-10      183
6:  6 Client C Stage 4  2017-09-09       NA

Data

library(data.table)
DT <- fread("id,   Client,   Stage, Stage.Start
                  1, Client A, Stage 1,  2017/01/01
                  2, Client B, Stage 1,  2017/03/04
                  3, Client C, Stage 2,  2017/03/10
                  4, Client A, Stage 2,  2017/02/03
                  5, Client A, Stage 3,  2017/06/01
                  6, Client C, Stage 4,  2017/09/09")
DT[, Stage.Start := as.IDate(Stage.Start, "%Y/%m/%d")]

Upvotes: 3

Related Questions