Reputation: 1011
I'm working in R with xts time series.
What I have: A time series data set with unequally spaced time steps.
What I'd like to get: A time series with equally spaced time steps whose values correspond to the proportion of the original values overlapping the time step (see example below).
Example: With an original series like this:
sample_xts <- as.xts(read.zoo(text='
2016-07-01 00:00:20, 0.0
2016-07-01 00:01:20, 60.0
2016-07-01 00:01:50, 30.0
2016-07-01 00:02:30, 40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00, 97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09, 0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S"))
names(sample_xts) <- c('x')
I'd like to get an equally spaced time series that looks like this:
x
2016-07-01 00:00:00, 0.0
2016-07-01 00:01:00, 40.0
2016-07-01 00:02:00, 60.0
2016-07-01 00:03:00, 60.0
2016-07-01 00:04:00, 60.0
2016-07-01 00:05:00, 100.0
2016-07-01 00:06:00, 157.0
2016-07-01 00:07:00, 120.0
2016-07-01 00:08:00, 24.0
2016-07-01 00:09:00, 0.0
Note:
Here is the sketch I used to create the above example (may help to illustrate what I'd like to do):
I'd like an approach that isn't limited to creating a 1 minute time step series but generally to any fixed time step.
I have looked at many q/a on stackoverflow and tried out many different things but without success.
Any help would be greatly appreciated! Thanks.
Upvotes: 4
Views: 1605
Reputation: 1011
I finally decided to go the "while-loop-way" with this and have created the solution below. It works well - is not super fast but execution time seems to be proportional to the length of the time series. I tested it both with the little example I posted in the question and with a source time series having 330 000 observations and a destination series of about 110 000 time steps.
Both source and destination series can have irregular time steps. The sum of the resulting series is the same as that of the source.
Performance: It is ok fast but I'm sure it could be faster. I guess it's an obvious candidate for a RCpp version which should be significantly faster for long series. For now this will do for me and if/when I get round to creating an RCpp version I'll post here.
Please post if you have suggestions for performance improvement!
Thanks!
sameEndTime <- function(i,j,src_index,dest_index){
if(src_index[i] == dest_index[j]){
TRUE
} else {
FALSE
}
}
wholeSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(dest_index[j-1] <= src_index[i-1] & src_index[i] <= dest_index[j]){
TRUE
} else {
FALSE
}
}
wholeDestStepIsWithinSourceStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] <= dest_index[j-1] & dest_index[j] <= src_index[i]){
TRUE
} else {
FALSE
}
}
onlyEndOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] < dest_index[j-1] & src_index[i] < dest_index[j] & src_index[i] > dest_index[j-1]){
TRUE
} else {
FALSE
}
}
onlyStartOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] < dest_index[j] & src_index[i-1] > dest_index[j-1] & src_index[i] > dest_index[j]){
TRUE
} else {
FALSE
}
}
resampleToDestTimeSteps <- function(src, dest){
# src and dest are both xts with only one time series each
# src is the original series and
# dest holds the time steps of the final series
#
# NB: there is an issue with the very first time step
# (gets ignored in this version)
#
original_names <- names(src)
names(src) <- c("value")
names(dest) <- c("value")
dest$value <- dest$value*0.0
dest$value[is.na(dest$value)] <- 0.0
dest[1]$value = 0.0
for(k in 2:length(src)){
src[k]$value <- src[k]$value/as.numeric(difftime(index(src[k]),index(src[k-1]),units="secs"))
}
# First value is NA due to lag at this point (we don't want that)
src$value[1] = 0.0
i = 2 # source timestep counter
j = 2 # destination timestep counter
src_index = index(src)
dest_index = index(dest)
src_length = length(src)
dest_length = length(dest)
# Make sure we start with an overlap
if(src_index[2] < dest_index[1]){
while(src_index[i] < dest_index[1]){
i = i + 1
}
} else if(dest_index[2] < src_index[1]){
while(dest_index[j] < src_index[1]){
j = j + 1
}
}
while(i <= src_length & j <= dest_length){
if( wholeSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],src_index[i-1],units="secs"))
if(sameEndTime(i,j,src_index,dest_index)){
j = j+1
}
i = i+1
} else if( wholeDestStepIsWithinSourceStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(dest_index[j],dest_index[j-1],units="secs"))
if(sameEndTime(i,j,src_index,dest_index)){
i = i+1
}
j = j+1
} else if( onlyEndOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],dest_index[j-1],units="secs"))
i = i+1
} else if( onlyStartOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
diff_time = difftime(dest_index[j],src_index[i-1],units="secs")
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(diff_time)
j = j+1
} else {
print("======================================================")
print(paste0("i=",i,", j=",j))
print(paste0("src_index[i] =",src_index[i]))
print(paste0("dest_index[j] =",dest_index[j]))
print(" ")
print(paste0("src_index[i-1] =",src_index[i-1]))
print(paste0("dest_index[j-1]=",dest_index[j-1]))
print("======================================================")
stop("This should never happen.")
}
}
names(dest) <- original_names
return(dest)
}
Upvotes: 0
Reputation: 173
Here is some code I wrote using zoo
- I haven't used xts
much so I don't know if the same functions can be applied. Hope that helps!
Functions
The following function calculates, for each interval of the original data, the fraction that overlaps with a given interval (Note: In all of the following code, the variable names ta1
and ta2
refer to the start and end of a given time interval (e.g. each of the equal intervals that you need as output), while tb1
and tb2
refer to the start and end of the (unequal) intervals of the original data):
frac.overlap <- function(ta1,ta2,tb1,tb2){
if(tb1 <= ta1 & tb2 >= ta2) { # Interval 2 starts earlier and ends later than interval 1
frac <- as.numeric(difftime(ta2,ta1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else if(tb1 >= ta1 & tb2 <= ta2) { # Interval 2 is fully contained within interval 1
frac <- 1
} else if(tb1 <= ta1 & tb2 >= ta1) { # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
frac <- as.numeric(difftime(tb2,ta1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else if (tb1 <= ta2 & tb2 >= ta2){ # Interval 2 partly overlaps with interval 1 (starts later, ends later)
frac <- as.numeric(difftime(ta2,tb1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else { # No overlap
frac <- 0
}
return(frac)
}
The next function determines which records of the original dataset overlap with the currently considered interval ta1
-ta2
:
check.overlap <- function(ta1,ta2,tb1,tb2){
ov <- vector("logical",4)
ov[1] <- (tb1 <= ta1 & tb2 >= ta2) # Interval 2 starts earlier and ends later than interval 1
ov[2] <- (tb1 >= ta1 & tb2 <= ta2) # Interval 2 is fully contained within interval 1
ov[3] <- (tb1 <= ta1 & tb2 >= ta1) # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
ov[4] <- (tb1 <= ta2 & tb2 >= ta2) # Interval 2 partly overlaps with interval 1 (starts later, ends later)
return(as.logical(sum(ov))) # Gives TRUE if at least one element of ov is TRUE, otherwise FALSE
}
(Note: this works well with the sample data that you provided, but on a larger dataset, I found it to be prohibitively slow. Since I wrote this code to resample time series with a regular time step, I usually use a fixed interval to complete this step, which is dramatically faster. It's probably easy enough to modify the code (see code of the next function) to speed up this step based on the intervals of the original data.)
The next function uses the previous two to calculate the resampled value for an interval ta1
-ta2
:
fracres <- function(tstart,interval,input){
# tstart: POSIX object
# interval: length of interval in seconds
# input: zoo object
ta1 <- tstart
ta2 <- tstart + interval
# First, determine which records of the original data (input) overlap with the current
# interval, to avoid going through the whole object at every iteration
ind <- index(input)
ind1 <- index(lag(input,-1))
recs <- which(sapply(1:length(ind),function(x) check.overlap(ta1,ta2,ind[x],ind1[x])))
#recs <- which(abs(as.numeric(difftime(ind,ta1,units="secs"))) < 601)
# For each record overlapping with the current interval, return the fraction of the input data interval contained in the current interval
if(length(recs) > 0){
fracs <- sapply(1:length(recs), function(x) frac.overlap(ta1,ta2,ind[recs[x]],ind1[recs[x]]))
return(sum(coredata(input)[recs]*fracs))
} else {
return(0)
}
}
(The commented-out line shows how to get the relevant records if the maximum time difference between the original and new time steps is known.)
Application
First, let's read in your sample data as a zoo
object:
sample_zoo <- read.zoo(text='
2016-07-01 00:00:20, 0.0
2016-07-01 00:01:20, 60.0
2016-07-01 00:01:50, 30.0
2016-07-01 00:02:30, 40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00, 97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09, 0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S")
It looks like your dataset contains instantaneous values ("at 01:20
, the value of x
is 60"). Since I wrote this code for summed values, the meaning of the time stamp is different ("the record starting at 01:20
has a value of 60"). To correct for this, the records need to be shifted:
sample_zoo <- lag(sample_zoo,1)
Then, we define a sequence of POSIXct
objects corresponding to the desired resolution:
time.out <- seq.POSIXt(from=as.POSIXct("2016-07-01"),to=(as.POSIXct("2016-07-01")+(60*9)),by="1 min")
We can then apply the function fracres
, described above:
data.out <- sapply(1:length(time.out), function(x) fracres(tstart=time.out[x],interval=60,input=sample_zoo))
The index and data are combined to a zoo
object:
zoo.out <- read.zoo(data.frame(time.out,data.out))
And finally, the time series is shifted again by one step, in the opposite direction as before:
zoo.out <- lag(zoo.out,-1)
2016-07-01 00:01:00 2016-07-01 00:02:00 2016-07-01 00:03:00 2016-07-01 00:04:00 2016-07-01 00:05:00 2016-07-01 00:06:00 2016-07-01 00:07:00 2016-07-01 00:08:00 2016-07-01 00:09:00
40 60 60 60 100 157 120 24 0
Upvotes: 1