aqualight213
aqualight213

Reputation: 3

Merging two dataframes with irregular timestamp columns

I have a equity related data sample with prices that are date/time stamped at irregular, random intervals in the seconds, called ESH5 and ESM5

I would like to generate another full data.frame, with a date/time column that increments the time by seconds, and fill in with the values from ESH5, ESM5. Any values will become the 'latest price' which will carry over to the next time interval unless a value exists for the matching time in ESH5 or ESM5

For example, ESH5

    Date                 Price    Type
    22/10/2015 9:00:00   50.10    Bid
    22/10/2015 9:00:02   50.12    Ask
    22/10/2015 9:00:06   50.10    Trade

ESM5

    Date                 Price    Type
    22/10/2015 9:00:01   50.09    Bid
    22/10/2015 9:00:02   50.11    Ask
    22/10/2015 9:00:04   50.09    Trade

I am looking to generate a full data.frame like

    Date                 ESH5.Bid   ESH5.Ask   ESH5.Trade   ESM5.Bid   ESM5.Ask   ESM5.Trade
    22/10/2015 9:00:00      50.10         NA           NA        NA          NA          NA
    22/10/2015 9:00:01      50.10         NA           NA     50.09          NA          NA
    22/10/2015 9:00:02      50.10       50.12          NA     50.09       50.11          NA
    22/10/2015 9:00:03      50.10       50.12          NA     50.09       50.11          NA
    22/10/2015 9:00:04
    22/10/2015 9:00:05
    22/10/2015 9:00:06

Currently, I am generating the table with a for loop and if-else statements. I pregenerate NA empty data frame with the regular timestamp increments, the latest updated bid, ask, trade, then run a conditional check for the same time to fill the table.

My current code works, however the loop takes quite a while to process (10s of minutes). Are there any internal functions inbuilt in R I can use for this search and replace and carry over-like features?

Apologies if this is a bit hard to follow. Thank you.

Upvotes: 0

Views: 1670

Answers (2)

RHA
RHA

Reputation: 3862

I think what you want to do requires several steps:

First, create the date column with seq, as indicated bu @akrun

Secondly rebuild the structure of your data. This can be done in several ways, but i think the dcast function from the reshape2 package is best:

ESH5c <- dcast(ESH5, Date ~ Type, value.var='Price')
ESM5c <- dcast(ESM5, Date ~ Type, value.var='Price')

And the last step is to merge these new data with your date vector.

Upvotes: 1

akrun
akrun

Reputation: 886948

Convert the 'Date' columns in both the datasets from 'character' class to 'POSIXct'.

ESH5$Date <- as.POSIXct(ESH5$Date, format='%d/%m/%Y %H:%M:%S')
ESM5$Date <- as.POSIXct(ESM5$Date, format='%d/%m/%Y %H:%M:%S')

Get the 'min' and 'max' value by concatenating the 'Date' columns of these datasets.

MinD <- min(c(ESH5$Date, ESM5$Date))
MaxD <- max(c(ESH5$Date, ESM5$Date))

Create a sequence of datetime based on 'MinD', and 'MaxD' values as a new dataset

d1 <- data.frame(Date=seq(MinD, MaxD, by='sec'))

merge all the datasets by placing them in a 'list' and using Reduce

d2 <- Reduce(function(...) merge(..., by='Date', all=TRUE),
           list(d1, ESH5, ESM5))

Reshape the 'd2' dataset from 'wide' to 'long'

dLong <- reshape(d2, idvar='Date', varying=2:5, sep=".", direction='long')
dLong$time <- factor(dLong$time, labels=c('ESH5', 'ESM5'))
row.names(dLong) <- NULL

Change the 'long' format to 'wide' using dcast

library(reshape2) 
res <- dcast(dLong, Date~time+Type, value.var='Price')

Remove the extra NA columns

res1 <- res[!grepl('NA', names(res))]

Use na.locf from zoo to fill the NA values with the previous non-NA value

library(zoo)
res1[-1] <- lapply(res1[-1], na.locf, na.rm=FALSE)
res1
#                 Date ESH5_Ask ESH5_Bid ESH5_Trade ESM5_Ask ESM5_Bid ESM5_Trade
#1 2015-10-22 09:00:00       NA     50.1         NA       NA       NA         NA
#2 2015-10-22 09:00:01       NA     50.1         NA       NA    50.09         NA
#3 2015-10-22 09:00:02    50.12     50.1         NA    50.11    50.09         NA
#4 2015-10-22 09:00:03    50.12     50.1         NA    50.11    50.09         NA
#5 2015-10-22 09:00:04    50.12     50.1         NA    50.11    50.09      50.09
#6 2015-10-22 09:00:05    50.12     50.1         NA    50.11    50.09      50.09
#7 2015-10-22 09:00:06    50.12     50.1       50.1    50.11    50.09      50.09

Or using dplyr/tidyr, we can use spread to change the format of each datasets, full_join, them, along with 'd1', change the 'NA' values in each column with 'non-NA' previous value using na.locf and mutate_each. Change the column names with paste (if needed).

 library(dplyr)
 library(tidyr)
 library(zoo)

 res2 <-  full_join(spread(ESH5, Type, Price), 
                        spread(ESM5, Type, Price), by='Date') %>%
                    full_join(d1, ., by='Date') %>%
                    mutate_each(funs(na.locf(., na.rm=FALSE)), -Date) 

 names(res2)[-1] <- c(paste('ESH5', sort(ESH5$Type),sep="_"), 
                      paste('ESM5', sort(ESM5$Type), sep="_"))

  res2
  #               Date ESH5_Ask ESH5_Bid ESH5_Trade ESM5_Ask ESM5_Bid ESM5_Trade
  #1 2015-10-22 09:00:00       NA     50.1         NA       NA       NA         NA
  #2 2015-10-22 09:00:01       NA     50.1         NA       NA    50.09         NA
  #3 2015-10-22 09:00:02    50.12     50.1         NA    50.11    50.09         NA
  #4 2015-10-22 09:00:03    50.12     50.1         NA    50.11    50.09         NA
  #5 2015-10-22 09:00:04    50.12     50.1         NA    50.11    50.09      50.09
  #6 2015-10-22 09:00:05    50.12     50.1         NA    50.11    50.09      50.09
  #7 2015-10-22 09:00:06    50.12     50.1       50.1    50.11    50.09      50.09

Upvotes: 0

Related Questions