RitaM
RitaM

Reputation: 143

dplyr: Aggregate continuous datetime with conditions

I have a dataframe with million os rows. I have two datetime columns that indicate a certain interval that a device has been working. After some corrections I got the result as shown in the following dataframe.

data2 <- data.frame(id = 1:9,
               Device = c("A","A","A","A","A","A","A","B","B"),
               Chanell = c("111","111","111","111","111","111","111","333","333"),
               StartSession = c("2020-12-06 22:03:45","2020-12-06 22:10:07","2020-12-27 22:47:00","2020-12-31 15:06:02","2020-12-31 22:04:12","2020-12-31 23:16:05","2020-12-31 23:46:05","2021-01-02 00:29:05","2021-01-03 22:02:04"),
               EndSession = c("2020-12-06 22:10:07","2020-12-06 22:29:49","2020-12-27 23:53:09","2020-12-31 15:18:03","2020-12-31 22:19:14","2020-12-31 23:46:05","2021-01-01 00:06:00","2021-01-02 00:38:05","2021-01-03 23:08:07"))

data2$StartSession <- as.POSIXct(data2$StartSession,format = "%Y-%m-%d %H:%M:%S")
data2$EndSession <- as.POSIXct(data2$EndSession,format = "%Y-%m-%d %H:%M:%S")

data2

id Device  Chanell       StartSession           EndSession
1   A      111     2020-12-06 22:03:45    2020-12-06 22:10:07
2   A      111     2020-12-06 22:10:07    2020-12-06 22:29:49
3   A      111     2020-12-27 22:47:00    2020-12-27 23:53:09
4   A      111     2020-12-31 15:06:02    2020-12-31 15:18:03
5   A      111     2020-12-31 22:04:12    2020-12-31 22:19:14
6   A      111     2020-12-31 23:16:05    2020-12-31 23:46:05
7   A      111     2020-12-31 23:46:05    2021-01-01 00:06:00
8   B      333     2021-01-02 00:29:05    2021-01-02 00:38:05
9   B      333     2021-01-03 22:02:04    2021-01-03 23:08:07

Each row represents a Device session, in a given chanell and during an interval of time

After making the necessary corrections, i found the following problem: When a device has two or more continuous sessions on the same channel, then the information should be aggregated in a single session

For the example above, this is my expected result:

id Device  Chanell       StartSession           EndSession
1   A       111     2020-12-06 22:03:45     2020-12-06 22:29:49
3   A       111     2020-12-27 22:47:00     2020-12-27 23:53:09
4   A       111     2020-12-31 15:06:02     2020-12-31 15:18:03
5   A       111     2020-12-31 22:04:12     2020-12-31 22:19:14
6   A       111     2020-12-31 23:16:05     2021-01-01 00:06:00
7   B       333     2021-01-02 00:29:05     2021-01-02 00:38:05
8   B       333     2021-01-03 22:02:04     2021-01-03 23:08:07

Only 4 sessions from Device A should be affected as they are the only ones that are continuous (EndSession1 = StartSession2) and Chanell is the same for both sessions

I had already posted a question on the topic here

https://stackoverflow.com/a/68407210/15926848

but unfortunately after testing the real problem i realized that i wasn't getting the intended answer. Unfortunately, I couldn't change it to get the expected result.

By the way, should I delete the other post?

tks in advance

Upvotes: 1

Views: 81

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388817

Try this data.table option for faster processing -

library(data.table)

setDT(data2)[, group := cumsum(StartSession - shift(EndSession, fill = 0) > 0), .(Device, Chanell)]

data2[, .(id = first(id),
          StartSession = min(StartSession), 
          EndSession = max(EndSession)), .(Device, Chanell, group)]

#   Device Chanell group id        StartSession          EndSession
#1:      A     111     1  1 2020-12-06 22:03:45 2020-12-06 22:29:49
#2:      A     111     2  3 2020-12-27 22:47:00 2020-12-27 23:53:09
#3:      A     111     3  4 2020-12-31 15:06:02 2020-12-31 15:18:03
#4:      A     111     4  5 2020-12-31 22:04:12 2020-12-31 22:19:14
#5:      A     111     5  6 2020-12-31 23:16:05 2021-01-01 00:06:00
#6:      B     333     1  8 2021-01-02 00:29:05 2021-01-02 00:38:05
#7:      B     333     2  9 2021-01-03 22:02:04 2021-01-03 23:08:07

Upvotes: 1

Kozolovska
Kozolovska

Reputation: 1119

My attempt

dat <- data2 %>% 
  group_by(Device, Chanell) %>% 
  mutate(time_diff = (StartSession) - lag(EndSession, 1)) %>% 
  ungroup()

Find the differences for each channel and device between the end of the previous session and the start of the new one.

ind.same.date <- which(dat$time_diff == 0)
dat$StartSession[ind.same.date] <- dat$StartSession[ind.same.date - 1]

If the difference is zero then overwrite the previous start session.

dat <- dat %>% 
  slice(-c(ind.same.date - 1)) %>% 
  select(-time_diff)

Remove unnecessary rows and column.

# A tibble: 7 x 5
     id Device Chanell StartSession        EndSession         
  <int> <chr>  <chr>   <dttm>              <dttm>             
1     2 A      111     2020-12-06 22:03:45 2020-12-06 22:29:49
2     3 A      111     2020-12-27 22:47:00 2020-12-27 23:53:09
3     4 A      111     2020-12-31 15:06:02 2020-12-31 15:18:03
4     5 A      111     2020-12-31 22:04:12 2020-12-31 22:19:14
5     7 A      111     2020-12-31 23:16:05 2021-01-01 00:06:00
6     8 B      333     2021-01-02 00:29:05 2021-01-02 00:38:05
7     9 B      333     2021-01-03 22:02:04 2021-01-03 23:08:07

Upvotes: 3

Related Questions