Reputation: 3
I have a equity related data sample with prices that are date/time stamped at irregular, random intervals in the seconds, called ESH5
and ESM5
I would like to generate another full data.frame, with a date/time column that increments the time by seconds, and fill in with the values from ESH5
, ESM5
. Any values will become the 'latest price' which will carry over to the next time interval unless a value exists for the matching time in ESH5
or ESM5
For example, ESH5
Date Price Type
22/10/2015 9:00:00 50.10 Bid
22/10/2015 9:00:02 50.12 Ask
22/10/2015 9:00:06 50.10 Trade
ESM5
Date Price Type
22/10/2015 9:00:01 50.09 Bid
22/10/2015 9:00:02 50.11 Ask
22/10/2015 9:00:04 50.09 Trade
I am looking to generate a full data.frame like
Date ESH5.Bid ESH5.Ask ESH5.Trade ESM5.Bid ESM5.Ask ESM5.Trade
22/10/2015 9:00:00 50.10 NA NA NA NA NA
22/10/2015 9:00:01 50.10 NA NA 50.09 NA NA
22/10/2015 9:00:02 50.10 50.12 NA 50.09 50.11 NA
22/10/2015 9:00:03 50.10 50.12 NA 50.09 50.11 NA
22/10/2015 9:00:04
22/10/2015 9:00:05
22/10/2015 9:00:06
Currently, I am generating the table with a for loop and if-else statements. I pregenerate NA
empty data frame with the regular timestamp increments, the latest updated bid
, ask
, trade
, then run a conditional check for the same time to fill the table.
My current code works, however the loop takes quite a while to process (10s of minutes). Are there any internal functions inbuilt in R I can use for this search and replace and carry over-like features?
Apologies if this is a bit hard to follow. Thank you.
Upvotes: 0
Views: 1670
Reputation: 3862
I think what you want to do requires several steps:
First, create the date column with seq
, as indicated bu @akrun
Secondly rebuild the structure of your data. This can be done in several ways, but i think the dcast
function from the reshape2
package is best:
ESH5c <- dcast(ESH5, Date ~ Type, value.var='Price')
ESM5c <- dcast(ESM5, Date ~ Type, value.var='Price')
And the last step is to merge
these new data with your date vector.
Upvotes: 1
Reputation: 886948
Convert the 'Date' columns in both the datasets from 'character' class to 'POSIXct'.
ESH5$Date <- as.POSIXct(ESH5$Date, format='%d/%m/%Y %H:%M:%S')
ESM5$Date <- as.POSIXct(ESM5$Date, format='%d/%m/%Y %H:%M:%S')
Get the 'min' and 'max' value by concatenating the 'Date' columns of these datasets.
MinD <- min(c(ESH5$Date, ESM5$Date))
MaxD <- max(c(ESH5$Date, ESM5$Date))
Create a sequence of datetime based on 'MinD', and 'MaxD' values as a new dataset
d1 <- data.frame(Date=seq(MinD, MaxD, by='sec'))
merge
all the datasets by placing them in a 'list' and using Reduce
d2 <- Reduce(function(...) merge(..., by='Date', all=TRUE),
list(d1, ESH5, ESM5))
Reshape the 'd2' dataset from 'wide' to 'long'
dLong <- reshape(d2, idvar='Date', varying=2:5, sep=".", direction='long')
dLong$time <- factor(dLong$time, labels=c('ESH5', 'ESM5'))
row.names(dLong) <- NULL
Change the 'long' format to 'wide' using dcast
library(reshape2)
res <- dcast(dLong, Date~time+Type, value.var='Price')
Remove the extra NA columns
res1 <- res[!grepl('NA', names(res))]
Use na.locf
from zoo
to fill the NA values with the previous non-NA value
library(zoo)
res1[-1] <- lapply(res1[-1], na.locf, na.rm=FALSE)
res1
# Date ESH5_Ask ESH5_Bid ESH5_Trade ESM5_Ask ESM5_Bid ESM5_Trade
#1 2015-10-22 09:00:00 NA 50.1 NA NA NA NA
#2 2015-10-22 09:00:01 NA 50.1 NA NA 50.09 NA
#3 2015-10-22 09:00:02 50.12 50.1 NA 50.11 50.09 NA
#4 2015-10-22 09:00:03 50.12 50.1 NA 50.11 50.09 NA
#5 2015-10-22 09:00:04 50.12 50.1 NA 50.11 50.09 50.09
#6 2015-10-22 09:00:05 50.12 50.1 NA 50.11 50.09 50.09
#7 2015-10-22 09:00:06 50.12 50.1 50.1 50.11 50.09 50.09
Or using dplyr/tidyr
, we can use spread
to change the format of each datasets, full_join
, them, along with 'd1', change the 'NA' values in each column with 'non-NA' previous value using na.locf
and mutate_each
. Change the column names with paste
(if needed).
library(dplyr)
library(tidyr)
library(zoo)
res2 <- full_join(spread(ESH5, Type, Price),
spread(ESM5, Type, Price), by='Date') %>%
full_join(d1, ., by='Date') %>%
mutate_each(funs(na.locf(., na.rm=FALSE)), -Date)
names(res2)[-1] <- c(paste('ESH5', sort(ESH5$Type),sep="_"),
paste('ESM5', sort(ESM5$Type), sep="_"))
res2
# Date ESH5_Ask ESH5_Bid ESH5_Trade ESM5_Ask ESM5_Bid ESM5_Trade
#1 2015-10-22 09:00:00 NA 50.1 NA NA NA NA
#2 2015-10-22 09:00:01 NA 50.1 NA NA 50.09 NA
#3 2015-10-22 09:00:02 50.12 50.1 NA 50.11 50.09 NA
#4 2015-10-22 09:00:03 50.12 50.1 NA 50.11 50.09 NA
#5 2015-10-22 09:00:04 50.12 50.1 NA 50.11 50.09 50.09
#6 2015-10-22 09:00:05 50.12 50.1 NA 50.11 50.09 50.09
#7 2015-10-22 09:00:06 50.12 50.1 50.1 50.11 50.09 50.09
Upvotes: 0