Reputation: 57
I am working on a project where I have few million rows and each contains timestamp. Each row also includes the unique ID of that event. Current condition is that rows with the same event ID can have timestamp difference of 1 minute (there cant be two events with delta timestamp less than 1minute).
What I want to simulate is the situation which would happen if minimal timestamp difference would be 3 minutes.
TIME_STAMP PREV_TIME_STAMP Unique ID
06-27-2021 07:07:22 06-27-2021 06:30:00 1
06-27-2021 07:18:26 06-27-2021 07:07:22 1
06-27-2021 07:20:26 06-27-2021 07:18:26 1
06-27-2021 07:22:26 06-27-2021 07:20:26 1
06-27-2021 07:22:26 06-27-2021 07:22:26 1
06-27-2021 15:18:05 06-27-2021 15:11:00 2
06-27-2021 15:19:05 06-27-2021 15:18:05 2
06-27-2021 12:31:37 06-27-2021 12:30:00 2
06-27-2021 12:35:05 06-27-2021 12:30:00 2
The problem is that I cant only make new column with difference between messages, I do need a loop for this - why? see below:
From the table the situation is following:
So there is need to define a referent timestamp (it is the previous ACCEPTED time), and delta between new time and previous time WHICH IS ACCEPTED has to be 3 minutes or higher.
I hope that I managed to explain it well enough. If not, please answer and I will provide as much as information I can.
Thanks in advance!
EDIT:
df <- data.frame(TIME_STAMP = as.POSIXct(strptime(
c("06-27-2021 07:07:22",
"06-27-2021 07:18:26",
"06-27-2021 07:20:26",
"06-27-2021 07:22:26",
"06-27-2021 07:22:26",
"06-27-2021 15:18:05",
"06-27-2021 15:19:05",
"06-27-2021 12:31:37",
"06-27-2021 12:35:05"), "%m-%d-%Y %H:%M:%S")),
PREV_TIME_STAMP = as.POSIXct(strptime(
c("06-27-2021 06:30:00",
"06-27-2021 07:07:22",
"06-27-2021 07:18:26",
"06-27-2021 07:20:26",
"06-27-2021 07:22:26",
"06-27-2021 15:11:00",
"06-27-2021 15:18:05",
"06-27-2021 12:30:00",
"06-27-2021 12:30:00"), "%m-%d-%Y %H:%M:%S")),
ID = c(1,1,1,1,1,2,2,2,2))
Upvotes: 5
Views: 694
Reputation: 46876
This can be done iteratively. The idea is to identify points that must be included, use these to remove points that can't be included and to repeat until done.
Here's some simple data, not time stamps but integers (timestamps are easily converted to integer via as.integer()
), supposing that we are interested in a 'width' of 10 -- starts less than 10 units apart are to be filtered.
set.seed(123)
start <- sort(sample(100, 10))
width <- 10
We write a function to take the start and width
filter1 <- function(start, width) {
Construct intervals for each start
end <- start + width - 1L # closed interval
figure out how to put the start and end in order, and remember which start event corresponds to the ordering
o <- order(c(start, end))
is_start <- rep(c(TRUE, FALSE), each = length(start))[o]
encode start events as 1, end events as -1, and calculate the 'coverage', the number of events that are open
event <- rep(c(1, -1), each = length(start))[o] # 1 == open, -1 == close
cvg <- cumsum(event) # number of open intervals
We know for sure that we want to keep start events where the coverage increments to 1, so keep those
must <- (event == 1 & cvg == 1)[is_start]
open <- start[must] # non-overlapping events
and find the event that are NOT in those intervals
close <- end[must] + 1L
might <- findInterval(start, sort(c(open, close))) %% 2 == 0
return the values that we know to be good, and that we have not yet excluded
must | might # best guess, so far
}
The complete function is
filter1 <- function(start, width) {
end <- start + width - 1L # closed interval
o <- order(c(start, end))
is_start <- rep(c(TRUE, FALSE), each = length(start))[o]
event <- rep(c(1, -1), each = length(start))[o] # 1 == open, -1 == close
cvg <- cumsum(event) # number of open intervals
must <- (event == 1 & cvg == 1)[is_start]
open <- start[must] # non-overlapping events
close <- end[must] + 1L
might <- findInterval(start, sort(c(open, close))) %% 2 == 0
must | might
}
We now have a shortened vector of possible candidates; we iterate until the length of candidates does not change
filter_all <- function(start, width) {
idx <- !logical(length(start))
repeat {
idx0 <- filter1(start[idx], width)
if (sum(idx0) == sum(idx))
break
idx[idx] <- idx0
}
idx
}
in action:
> set.seed(123)
> (start <- sort(sample(100, 10)))
[1] 5 29 41 42 50 51 79 83 86 91
> keep <- filter_all(start, 10)
> start[keep]
[1] 5 29 41 51 79 91
This might not be correct (but can be made so) in the case where intervals end and start at the same location. Worst-case performance would be linear in the number of starts (when the end of one interval exactly overlaps the start of another, width units along), but seems like it will often be approximately logarithmic.
This can be applied to groups using data.table, or base R functions such as ave()
.
Upvotes: 1
Reputation: 25225
Assuming that it is just typo in last 2 entries of PREV_TIME_STAMP for ID=2, here is another method using Reduce
using Ronald's dataset.
#sort by TIME_STAMP to make sure older entries come up first
DT[order(TIME_STAMP),
#convert numeric to POSIX
as.POSIXct(
#get a distinct set of timestamp that is greater than 3 minutes
unique(
#use curr if more than 3 mins from prev, else keep the prev value
Reduce(function(x,y) if(as.double(y-x,units="mins") >= 3) y else x,
TIME_STAMP,
accumulate=TRUE)
),
origin="1970-01-01", tz="GMT"),
by=`Unique ID`]
edit: share timings. tl;dr Roland method is way faster
library(data.table)
set.seed(0L)
M <- 2e6
nIDs <- M/1e3
DT <- data.table(
ID=sample(nIDs, M, replace=TRUE),
TIME_STAMP=as.POSIXct(as.numeric(Sys.time())+sample(60*(0:4), M, replace=TRUE), origin="1970-01-01", tz="GMT"))
setorder(DT, ID, TIME_STAMP)
DT2 <- copy(DT)
library(Rcpp)
cppFunction(
'LogicalVector deleteRow(const NumericVector x) {
const double n = x.size();
double j = 0;
LogicalVector res = LogicalVector(n);
for (double i = 1; i < n; i++) {
if (x(i) - x(j) < 180) {
res[i] = true;
} else {
j = i;
}
}
return res;
}')
filter1 <- function(start, width) {
end <- start + width - 1L # closed interval
o <- order(c(start, end))
is_start <- rep(c(TRUE, FALSE), each = length(start))[o]
event <- rep(c(1, -1), each = length(start))[o] # 1 == open, -1 == close
cvg <- cumsum(event) # number of open intervals
must <- (event == 1 & cvg == 1)[is_start]
open <- start[must] # non-overlapping events
close <- end[must] + 1L
might <- findInterval(start, sort(c(open, close))) %% 2 == 0
must | might
}
filter_all <- function(start, width) {
idx <- !logical(length(start))
repeat {
idx0 <- filter1(start[idx], width)
if (sum(idx0) == sum(idx))
break
idx[idx] <- idx0
}
idx
}
basemtd <- function() {
DT[, filter_all(TIME_STAMP, 3), by=ID]
}
rcppmtd <- function() {
DT[, delete := deleteRow(TIME_STAMP), by=ID]
}
dtmtd2 <- function() {
DT2[,
as.POSIXct(
unique(
Reduce(function(x,y) if(as.double(y-x,units="mins") >= 3) y else x,
TIME_STAMP,
accumulate=TRUE)
),
origin="1970-01-01", tz="GMT"),
by=ID]
}
library(microbenchmark)
microbenchmark(basemtd(), rcppmtd(), dtmtd2(), times=3L)
timings:
Unit: milliseconds
expr min lq mean median uq max neval
basemtd() 3579.0786 3601.19295 3608.667733333333 3623.3073 3623.46230000000014 3623.6173 3
rcppmtd() 37.0085 37.53650 39.001500000000 38.0645 39.99800000000000 41.9315 3
dtmtd2() 210238.1842 210901.39020 211303.247133333323 211564.5962 211835.77860000001965 212106.9610 3
Upvotes: 1
Reputation: 132864
First you should rearrange the data and remove the redundancy of your two time columns:
library(data.table)
DT <- fread(" TIME_STAMP, Unique ID
06-27-2021 06:30:00, 1
06-27-2021 07:07:22, 1
06-27-2021 07:18:26, 1
06-27-2021 07:20:26, 1
06-27-2021 07:22:26, 1
06-27-2021 07:22:26, 1
06-27-2021 15:11:00, 2
06-27-2021 15:18:05, 2
06-27-2021 15:19:05, 2
06-27-2021 12:31:37, 2
06-27-2021 12:35:05, 2")
Then you can do this easily with Rcpp:
library(Rcpp)
cppFunction(
'LogicalVector deleteRow(const NumericVector x) {
const double n = x.size();
double j = 0;
LogicalVector res = LogicalVector(n);
for (double i = 1; i < n; i++) {
if (x(i) - x(j) < 180) {
res[i] = true;
} else {
j = i;
}
}
return res;
}')
DT[, TIME_STAMP := as.POSIXct(TIME_STAMP, format = "%m-%d-%Y %H:%M:%S", tz = "GMT")]
setkey(DT, `Unique ID`, TIME_STAMP) #ensure sorting
DT[, delete := deleteRow(TIME_STAMP), by = `Unique ID`]
# TIME_STAMP Unique ID delete
# 1: 2021-06-27 06:30:00 1 FALSE
# 2: 2021-06-27 07:07:22 1 FALSE
# 3: 2021-06-27 07:18:26 1 FALSE
# 4: 2021-06-27 07:20:26 1 TRUE
# 5: 2021-06-27 07:22:26 1 FALSE
# 6: 2021-06-27 07:22:26 1 TRUE
# 7: 2021-06-27 12:31:37 2 FALSE
# 8: 2021-06-27 12:35:05 2 FALSE
# 9: 2021-06-27 15:11:00 2 FALSE
#10: 2021-06-27 15:18:05 2 FALSE
#11: 2021-06-27 15:19:05 2 TRUE
Upvotes: 5