Reputation: 131
I would like to compare the times in the date-time column (which is the index column of an xts
object). Let’s say I have data set called My_Data
given as follows:
My_Data <- structure(c(5, 2, 4, 8, 9), index = structure(c(1184599268.133,
1184602925.231, 1184604481.931, 1184685301.769, 1184685668.133), tzone = "",
tclass = c("POSIXct", "POSIXt")), class = c("xts", "zoo"),
.indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct", "POSIXt"),
.indexTZ = "", tzone = "", .Dim = c(5L, 1L), .Dimnames = list(NULL, "Price"))
Note that since the data is an xts
object, the date-time column is in the index of the dataset and can be referred to by using the function index(My_Data)
For my program I am iterating over my dataset, and I would like to know when the date changes in the index of the data (which, in the data above, would be going from the dates 2007-07-16
to 2007-07-17
)
To this end, I’ve tried using the following function to extract the date at each index row:
(format(index(My_Data)[1], format = "%Y-%m-%d"))
Where this function will extract the date from the index of the first row of my data (which is 2007-07-16
). So to extract an arbitrary row i
I use
(format( index(My_Data)[i], format = "%Y-%m-%d"))
And, to compare two different rows to find out if the dates are different I use:
(format( index(My_Data)[1], format = "%Y-%m-%d")) <
(format( index(My_Data)[2], format = "%Y-%m-%d"))
Which compares the date of the first index row with the date from the second index row.
However, this comparison is extremely slow – I checked this comparison using the microbenchmark
package and it is of the order of milliseconds! Since I have a large amount of data I was wondering if there was a more efficient way to check when the dates have changed in the index, since this method will slow my code down a lot.
Upvotes: 1
Views: 879
Reputation: 176648
You have (at least) a couple better options. Choose the one that works best for your situation.
You can use .indexDate()
and diff()
to tell you when the date of the index changes. Remember that diff()
returns an object with a length that has one observation less than its input, so you need to concatenate it with a leading zero or NA
.
merge(My_Data, newdate = c(0, diff(.indexDate(My_Data))))
# Price newdate
# 2007-07-16 10:21:08 5 0
# 2007-07-16 11:22:05 2 0
# 2007-07-16 11:48:01 4 0
# 2007-07-17 10:15:01 8 1
# 2007-07-17 10:21:08 9 0
You can also use endpoints()
to tell you the last observation for every day in your series. Remember that endpoints()
always returns a vector that starts with 0
and ends with nrow(x)
.
endpoints(My_Data, "days")
# [1] 0 3 5
The reason your solution takes longer is because converting to character strings is expensive. I would also guess that using logical operators on character strings is more expensive than on numbers, because character string sorting depends on your locale's collation order (so more operations are likely).
Upvotes: 1