Reputation: 4388
Objective
I have a dataset that tracks when someone was editing a message, left, and came back to resume editing that same message. I am trying to group this one message together and find its total duration.
Problem
I need to first group the message together if the following conditions apply:
if the Folder == 'out'
OR drafts
, the Message == ""
, Edit == "T"
, and if the contents in the
Subject and Re columns are consecutively the same.
Ideally, this would give me group A along with its duration. For example, this first 'block' of data would be labeled 'Group A' , would have a start time of 1/2/2020 1:00:01 AM and an end time of 1/2/2020 1:00:30 AM
Furthermore, I would like to 'match' group A with another 'block' of data if the last row of the Subject, Re and Length column matches its first row. So the second Group A would have a start time of 1/2/2020 1:02:00 AM and an end time of 1/2/2020 1:02:05 AM.
Subject Re Length Folder Message Date Edit
[email protected],[email protected] 80 out 1/2/2020 1:00:01 AM T
[email protected],[email protected] 80 out 1/2/2020 1:00:05 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:00:10 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:00:15 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:00:30 AM T
hey [email protected],[email protected] 80 draft 1/2/2020 1:02:00 AM T
hey [email protected],[email protected] 80 draft 1/2/2020 1:02:05 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:03:10 AM T
hey [email protected],[email protected] 80 out 1/2/2020 1:03:20 AM T
Desired Output
Start End Duration Group
1/2/2020 1:00:10 AM 1/2/2020 1:00:30 AM 20 A
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A
1/2/2020 1:03:10 AM 1/2/2020 1:03:20 AM 10 A
dput:
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"),
Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 2L), .Label = c("", "[email protected],[email protected]"
), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L,
NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("",
"draft", "out"), class = "factor"), Message = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L,
3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("",
"1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM",
"1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM",
"1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE,
NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA,
-13L))
I am thinking I may have to do an if/else statement or some sort of filter. Any help/suggestion or advice would be greatly appreciated.
library(tidyverse)
library(lubridate)
df1<-df2 %>%
mutate(Date = lubridate::mdy_hms(Date),
cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" & Message == "" & Subject == ? & Re == ? & Length == ? ,
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
Upvotes: 0
Views: 178
Reputation: 2688
The solution is going to be something like this. There are some differences between the data you provided and your desired solution, the column types are a little different (like Edit
is provided as a logical). Also, the rows are a little different so this won't be the same as your desired output (you provide a length of 100 in the data, not in your example)
I used a function from this post to assign Group labels. It gives a number you could modify to make it a letter if you want, but you need to handle what would happen if you have over 26 groups.
df <- df %>%
mutate(Message = ifelse(is.na(Message), '', Message)) %>%
mutate(
cond = Edit & Folder %in% c('out', 'draft') & Message == ''
) %>%
mutate(grp = cumsum(!cond)) %>%
filter(cond, Subject != '')
df$Date <- as.POSIXct(as.character(df$Date), format = '%m/%d/%Y %H:%M:%OS')
get_group_number = function(){
i = 0
function(){
i <<- i+1
i
}
}
group_number = get_group_number()
df <- df %>% group_by(Subject, Recipient, Length) %>% mutate(Group = group_number())
df %>% group_by(grp) %>%
summarise(Start = min(Date), End = max(Date),
Duration = End - Start, Group = unique(Group)[1])
# A tibble: 3 x 5
grp Start End Duration Group
<int> <dttm> <dttm> <drtn> <dbl>
1 0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs 1
2 2 2020-01-02 01:02:00 2020-01-02 01:02:05 5 secs 1
3 4 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs 2
I think you might also want to consider how you define groups, since it seems like you could have the same Subject, Recipient and Length by chance alone
Upvotes: 1