Reputation: 2824
We are monitoring 3 processes A, B and C that will always either be in level X, Y or Z. A protocol records when a process changes levels.
df = read.csv(tc <- textConnection('Time1,Process1,Level1
2013-01-09 18:00:34,A,X
2013-01-09 18:00:34,B,Y
2013-01-09 18:00:34,C,X
2013-01-09 22:00:59,A,Z
2013-01-10 00:10:38,A,X
2013-01-10 18:38:35,B,Z
2013-01-11 05:03:11,A,Z
2013-01-11 11:09:10,C,Y
2013-01-11 12:01:18,A,Off
2013-01-11 12:01:18,B,Off
2013-01-11 12:01:18,C,Off
'),header=TRUE)
close.connection(tc)
df$Time1 = as.POSIXct(df$Time1)
Monitoring was started at 2013-01-09 18:00:34 and switched off at 2013-01-11 12:01:18. Between 2013-01-09 18:00:34 and 2013-01-09 22:00:59 process A was in level X, between 2013-01-09 22:00:59 and 2013-01-10 00:10:38 process A was in level Z.
For charting purposes, we would like to insert the last and first level state for each process for each midnight into the protocol:
2013-01-09 23:59:59,A,Z
2013-01-10 00:00:00,A,Z
2013-01-10 23:59:59,A,X
2013-01-11 00:00:00,A,X
2013-01-09 23:59:59,B,Y
2013-01-10 00:00:00,B,Y
2013-01-10 23:59:59,B,Z
2013-01-11 00:00:00,B,Z
2013-01-09 23:59:59,C,X
2013-01-10 00:00:00,C,X
2013-01-10 23:59:59,C,X
2013-01-11 00:00:00,C,X
It's ok to assume that there is not event in the log between 23:59:59 and 00:00:00. Finally, the protocol will be sorted by Time1 after insertion (that we can figure out ourselves). Any guidance is much appreciated!
Upvotes: 2
Views: 87
Reputation: 118849
(+1) Quite intricate and interesting task. I think I have the answer. I'll try to explain the method here. I hope it makes sense. There are two tricky bits here. My solution uses data.table
.
First:
I found it easier to construct first, the first two columns of the output that you require. This is done in the first part of the code shown below:
require(data.table)
dates <- unique(as.character(strptime(as.character(df$Time1), "%Y-%m-%d")))
dates <- dates[1:(length(dates)-1)]
dates <- strptime(paste(dates, "23:59:59"), "%Y-%m-%d %H:%M:%S")
dates <- sort(c(dates, dates+1))
Time <- rep(dates, length(levels(df$Process1)))
Process <- rep(levels(df$Process1), each=length(dates))
dt.out <- data.table(Time=as.POSIXct(Time), Process=Process)
# data.table outputs crazy values if not converted using as.POSIXct..?!
This should be straightforward to understand by looking at what each line of code does. I hope its extendable for other scenarios.
Second:
The second bit is equally tricky, but it could be accomplished in one line using data.table
. It took a while to figure out, but its awesome!
dt <- data.table(df, key="Process1") # convert input data.frame to data.table
out <- dt.out[, dt[J(Process)]$Level1[max(which(dt[J(Process)]$Time1 < Time))],
by = c("Process", "Time")]
> out
Process Time V1
1: A 2013-01-09 23:59:59 Z
2: A 2013-01-10 00:00:00 Z
3: A 2013-01-10 23:59:59 X
4: A 2013-01-11 00:00:00 X
5: B 2013-01-09 23:59:59 Y
6: B 2013-01-10 00:00:00 Y
7: B 2013-01-10 23:59:59 Z
8: B 2013-01-11 00:00:00 Z
9: C 2013-01-09 23:59:59 X
10: C 2013-01-10 00:00:00 X
11: C 2013-01-10 23:59:59 X
12: C 2013-01-11 00:00:00 X
Let me break these two lines into parts to explain what's happening.
In the first line, we set key
for dt
as Process1
. This allows VERY fast
filtering of the data by column Process1
. That is, dt["A"]
is equivalent to df[df$Process1 == "A"]
, but the former is blazing fast.
In the second line, quite a lot of things are happening. We already created dt.out
with the first two columns of the output required. All that remains is the third column. Look at the last part of the line that says by = c("Process", "Time")
. Here, we are splitting data.table dt.out
by these two variables. And to each one of the split data.table
, we apply dt[J(Process)]$Level1[max(which(dt[J(Process)]$Time1 < Time))]
which basically picks out the maximum index
from all the current Time1
values that are < Time
from the data.table
filtered by Process
and uses this maximum index to return the corresponding Levels1
value.
Hope this helps.
Upvotes: 3