How to take median of column values for NON DISTINCT time stamps only

Question

I am trying to clean some tick data. My data is in long form. When I convert it to wide it shows Error: Duplicate identifiers for rows . The Time column has timestamps for several days. The SYM column has stock symbols for many stocks. This is my sample data:

dput(jojo)
structure(list(Time = structure(c(1459481850, 1459481850, 1459482302, 
1459482305, 1459482305, 1459482307, 1459482307, 1459482309, 1459482312, 
1459482312, 1459482314, 1459482314, 1459482316, 1459482316, 1459482317, 
1459482317, 1459482318, 1459482319, 1459482319, 1459482320), class = c("POSIXct", 
"POSIXt"), tzone = "Asia/Calcutta"), PRICE = c(1371.25, 1371.25, 
1373.95, 1373, 1373, 1373.95, 1373.95, 1373.9, 1374, 1374, 1374.15, 
1374.15, 1374, 1374, 1373.85, 1373.85, 1372.55, 1374.05, 1374.05, 
1374.15), SIZE = c(39, 58, 5, 4, 7, 20, 5, 10, 21, 179, 10, 100, 
98, 78, 14, 11, 30, 10, 11, 39), SYM = c("A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B")), .Names = c("Time", "PRICE", "SIZE", "SYM"), row.names = c(NA, 
20L), class = "data.frame")

I need to first find the same time stamps then take median of PRICE and SIZE for these timestamps, and replace these same time stamp rows with a single row containing median values of PRICE and SIZE in the dataset. But my code summarises the whole column rather than the same time stamp rows for a stock symbol. This is my attempt:

#Cleaning duplicate time stamps
tt<- jojo %>%group_by(SYM )%>% summarise(Time = ifelse(n() >= 2, median, mean))
#Making wide form
tt<-spread(tt, SYM, PRICE)

I am getting this error:

Error in eval(substitute(expr), envir, enclos) : Not a vector

Please suggest corrections. It would be nice if I can do the cleaning without using highfrequency package.

Joshua Ulrich · Accepted Answer

You need to choose whether you want to use the dplyr or xts paradigm. They do not play well together, mostly because dplyr expects data.frames and xts objects are matrices. dplyr also masks the stats::lag generic, which prevents method dispatch (e.g. so running lag(.xts(1,1)) at the top-level will not do what you expect).

To solve this problem using the xts paradigm:

# create a function to convert to xts and take medians of the two columns
unDuplicate <- function(x) {
  # create xts object
  X <- xts(x[,c("PRICE","SIZE")], x[,"Time"])
  # set column names so they will be unique in wide format
  colnames(X) <- paste(colnames(X), x[1,"SYM"], sep = ".")
  # function to take median of each column
  colMedian <- function(obj, ...) {
    apply(obj, 2, median, ...)
  }
  # aggregate by seconds
  period.apply(X, endpoints(X, "seconds"), colMedian)
}
# now you can call the function on each symbol, then merge the results
do.call(merge, lapply(split(jojo, jojo$SYM), unDuplicate))

How to take median of column values for NON DISTINCT time stamps only

Answers (1)

Related Questions