Lynn
Lynn

Reputation: 4388

How to group time instances, find its duration and sum them based on specific conditions (in R or Python)

Objective

I have a dataset that tracks when someone was editing a message, left, and came back to resume editing that same message. I am trying to group this one message together and find its total duration.

Problem

I need to first group the message together if the following conditions apply: if the Folder == 'out' OR drafts, the Message == "", Edit == "T", and if the contents in the Subject and Re columns are consecutively the same.

Ideally, this would give me group A along with its duration. For example, this first 'block' of data would be labeled 'Group A' , would have a start time of 1/2/2020 1:00:01 AM and an end time of 1/2/2020 1:00:30 AM

Furthermore, I would like to 'match' group A with another 'block' of data if the last row of the Subject, Re and Length column matches its first row. So the second Group A would have a start time of 1/2/2020 1:02:00 AM and an end time of 1/2/2020 1:02:05 AM.

Subject Re                    Length         Folder      Message   Date                   Edit     
        [email protected],[email protected] 80             out                   1/2/2020 1:00:01 AM     T                               
        [email protected],[email protected] 80             out                   1/2/2020 1:00:05 AM     T                        
hey     [email protected],[email protected] 80             out                   1/2/2020 1:00:10 AM     T                        
hey     [email protected],[email protected] 80             out                   1/2/2020 1:00:15 AM     T                        
hey     [email protected],[email protected] 80             out                   1/2/2020 1:00:30 AM     T 



hey     [email protected],[email protected] 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     [email protected],[email protected] 80            draft                  1/2/2020 1:02:05 AM     T                        




hey     [email protected],[email protected] 80             out                   1/2/2020 1:03:10 AM     T                        
hey     [email protected],[email protected] 80             out                   1/2/2020 1:03:20 AM     T                        

Desired Output

 Start                  End                        Duration          Group
 1/2/2020 1:00:10 AM    1/2/2020 1:00:30 AM        20                A
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A

dput:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 
 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"), 
 Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
 1L, 1L, 2L, 2L), .Label = c("", "[email protected],[email protected]"
 ), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 
 NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L, 
 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", 
 "draft", "out"), class = "factor"), Message = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 
 3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("", 
 "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM", 
 "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM", 
 "1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
 ), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, 
 NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA, 
-13L))

I am thinking I may have to do an if/else statement or some sort of filter. Any help/suggestion or advice would be greatly appreciated.

  library(tidyverse)
  library(lubridate)



  df1<-df2 %>%
  mutate(Date = lubridate::mdy_hms(Date), 
  cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" &     Message == "" & Subject ==  ?   & Re ==   ?     & Length == ?   , 
     grp = cumsum(!cond)) %>%
  filter(cond) %>%
  group_by(grp) %>%

Upvotes: 0

Views: 178

Answers (1)

astrofunkswag
astrofunkswag

Reputation: 2688

The solution is going to be something like this. There are some differences between the data you provided and your desired solution, the column types are a little different (like Edit is provided as a logical). Also, the rows are a little different so this won't be the same as your desired output (you provide a length of 100 in the data, not in your example)

I used a function from this post to assign Group labels. It gives a number you could modify to make it a letter if you want, but you need to handle what would happen if you have over 26 groups.

df <- df %>% 
  mutate(Message = ifelse(is.na(Message), '', Message)) %>% 
  mutate(
    cond = Edit & Folder %in% c('out', 'draft') & Message == ''
  ) %>% 
  mutate(grp = cumsum(!cond)) %>% 
  filter(cond, Subject != '')

df$Date <- as.POSIXct(as.character(df$Date), format = '%m/%d/%Y %H:%M:%OS')

get_group_number = function(){
  i = 0
  function(){
    i <<- i+1
    i
  }
}
group_number = get_group_number()
df <- df %>% group_by(Subject, Recipient, Length) %>% mutate(Group = group_number())

df %>% group_by(grp) %>% 
   summarise(Start = min(Date), End = max(Date),
             Duration = End - Start, Group = unique(Group)[1])
# A tibble: 3 x 5
    grp Start               End                 Duration Group
  <int> <dttm>              <dttm>              <drtn>   <dbl>
1     0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs      1
2     2 2020-01-02 01:02:00 2020-01-02 01:02:05  5 secs      1
3     4 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs      2

I think you might also want to consider how you define groups, since it seems like you could have the same Subject, Recipient and Length by chance alone

Upvotes: 1

Related Questions