Reputation: 143
I have a dataframe with million os rows. I have two datetime columns that indicate a certain interval that a device has been working. After some corrections I got the result as shown in the following dataframe.
data2 <- data.frame(id = 1:9,
Device = c("A","A","A","A","A","A","A","B","B"),
Chanell = c("111","111","111","111","111","111","111","333","333"),
StartSession = c("2020-12-06 22:03:45","2020-12-06 22:10:07","2020-12-27 22:47:00","2020-12-31 15:06:02","2020-12-31 22:04:12","2020-12-31 23:16:05","2020-12-31 23:46:05","2021-01-02 00:29:05","2021-01-03 22:02:04"),
EndSession = c("2020-12-06 22:10:07","2020-12-06 22:29:49","2020-12-27 23:53:09","2020-12-31 15:18:03","2020-12-31 22:19:14","2020-12-31 23:46:05","2021-01-01 00:06:00","2021-01-02 00:38:05","2021-01-03 23:08:07"))
data2$StartSession <- as.POSIXct(data2$StartSession,format = "%Y-%m-%d %H:%M:%S")
data2$EndSession <- as.POSIXct(data2$EndSession,format = "%Y-%m-%d %H:%M:%S")
data2
id Device Chanell StartSession EndSession
1 A 111 2020-12-06 22:03:45 2020-12-06 22:10:07
2 A 111 2020-12-06 22:10:07 2020-12-06 22:29:49
3 A 111 2020-12-27 22:47:00 2020-12-27 23:53:09
4 A 111 2020-12-31 15:06:02 2020-12-31 15:18:03
5 A 111 2020-12-31 22:04:12 2020-12-31 22:19:14
6 A 111 2020-12-31 23:16:05 2020-12-31 23:46:05
7 A 111 2020-12-31 23:46:05 2021-01-01 00:06:00
8 B 333 2021-01-02 00:29:05 2021-01-02 00:38:05
9 B 333 2021-01-03 22:02:04 2021-01-03 23:08:07
Each row represents a Device session, in a given chanell and during an interval of time
After making the necessary corrections, i found the following problem: When a device has two or more continuous sessions on the same channel, then the information should be aggregated in a single session
For the example above, this is my expected result:
id Device Chanell StartSession EndSession
1 A 111 2020-12-06 22:03:45 2020-12-06 22:29:49
3 A 111 2020-12-27 22:47:00 2020-12-27 23:53:09
4 A 111 2020-12-31 15:06:02 2020-12-31 15:18:03
5 A 111 2020-12-31 22:04:12 2020-12-31 22:19:14
6 A 111 2020-12-31 23:16:05 2021-01-01 00:06:00
7 B 333 2021-01-02 00:29:05 2021-01-02 00:38:05
8 B 333 2021-01-03 22:02:04 2021-01-03 23:08:07
Only 4 sessions from Device A should be affected as they are the only ones that are continuous (EndSession1 = StartSession2) and Chanell is the same for both sessions
I had already posted a question on the topic here
https://stackoverflow.com/a/68407210/15926848
but unfortunately after testing the real problem i realized that i wasn't getting the intended answer. Unfortunately, I couldn't change it to get the expected result.
By the way, should I delete the other post?
tks in advance
Upvotes: 1
Views: 81
Reputation: 388817
Try this data.table
option for faster processing -
library(data.table)
setDT(data2)[, group := cumsum(StartSession - shift(EndSession, fill = 0) > 0), .(Device, Chanell)]
data2[, .(id = first(id),
StartSession = min(StartSession),
EndSession = max(EndSession)), .(Device, Chanell, group)]
# Device Chanell group id StartSession EndSession
#1: A 111 1 1 2020-12-06 22:03:45 2020-12-06 22:29:49
#2: A 111 2 3 2020-12-27 22:47:00 2020-12-27 23:53:09
#3: A 111 3 4 2020-12-31 15:06:02 2020-12-31 15:18:03
#4: A 111 4 5 2020-12-31 22:04:12 2020-12-31 22:19:14
#5: A 111 5 6 2020-12-31 23:16:05 2021-01-01 00:06:00
#6: B 333 1 8 2021-01-02 00:29:05 2021-01-02 00:38:05
#7: B 333 2 9 2021-01-03 22:02:04 2021-01-03 23:08:07
Upvotes: 1
Reputation: 1119
My attempt
dat <- data2 %>%
group_by(Device, Chanell) %>%
mutate(time_diff = (StartSession) - lag(EndSession, 1)) %>%
ungroup()
Find the differences for each channel and device between the end of the previous session and the start of the new one.
ind.same.date <- which(dat$time_diff == 0)
dat$StartSession[ind.same.date] <- dat$StartSession[ind.same.date - 1]
If the difference is zero then overwrite the previous start session.
dat <- dat %>%
slice(-c(ind.same.date - 1)) %>%
select(-time_diff)
Remove unnecessary rows and column.
# A tibble: 7 x 5
id Device Chanell StartSession EndSession
<int> <chr> <chr> <dttm> <dttm>
1 2 A 111 2020-12-06 22:03:45 2020-12-06 22:29:49
2 3 A 111 2020-12-27 22:47:00 2020-12-27 23:53:09
3 4 A 111 2020-12-31 15:06:02 2020-12-31 15:18:03
4 5 A 111 2020-12-31 22:04:12 2020-12-31 22:19:14
5 7 A 111 2020-12-31 23:16:05 2021-01-01 00:06:00
6 8 B 333 2021-01-02 00:29:05 2021-01-02 00:38:05
7 9 B 333 2021-01-03 22:02:04 2021-01-03 23:08:07
Upvotes: 3