Reputation: 319
I am trying to clean some tick data. My data is in long form. When I convert it to wide it shows
Error: Duplicate identifiers for rows
. The Time column has timestamps for several days. The SYM column has stock symbols for many stocks. This is my sample data:
dput(jojo)
structure(list(Time = structure(c(1459481850, 1459481850, 1459482302,
1459482305, 1459482305, 1459482307, 1459482307, 1459482309, 1459482312,
1459482312, 1459482314, 1459482314, 1459482316, 1459482316, 1459482317,
1459482317, 1459482318, 1459482319, 1459482319, 1459482320), class = c("POSIXct",
"POSIXt"), tzone = "Asia/Calcutta"), PRICE = c(1371.25, 1371.25,
1373.95, 1373, 1373, 1373.95, 1373.95, 1373.9, 1374, 1374, 1374.15,
1374.15, 1374, 1374, 1373.85, 1373.85, 1372.55, 1374.05, 1374.05,
1374.15), SIZE = c(39, 58, 5, 4, 7, 20, 5, 10, 21, 179, 10, 100,
98, 78, 14, 11, 30, 10, 11, 39), SYM = c("A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B")), .Names = c("Time", "PRICE", "SIZE", "SYM"), row.names = c(NA,
20L), class = "data.frame")
I need to first find the same time stamps then take median of PRICE and SIZE for these timestamps, and replace these same time stamp rows with a single row containing median values of PRICE and SIZE in the dataset. But my code summarises the whole column rather than the same time stamp rows for a stock symbol. This is my attempt:
#Cleaning duplicate time stamps
tt<- jojo %>%group_by(SYM )%>% summarise(Time = ifelse(n() >= 2, median, mean))
#Making wide form
tt<-spread(tt, SYM, PRICE)
I am getting this error:
Error in eval(substitute(expr), envir, enclos) : Not a vector
Please suggest corrections. It would be nice if I can do the cleaning without using highfrequency package.
Upvotes: 1
Views: 263
Reputation: 176718
You need to choose whether you want to use the dplyr
or xts
paradigm. They do not play well together, mostly because dplyr
expects data.frames and xts
objects are matrices. dplyr
also masks the stats::lag
generic, which prevents method dispatch (e.g. so running lag(.xts(1,1))
at the top-level will not do what you expect).
To solve this problem using the xts
paradigm:
# create a function to convert to xts and take medians of the two columns
unDuplicate <- function(x) {
# create xts object
X <- xts(x[,c("PRICE","SIZE")], x[,"Time"])
# set column names so they will be unique in wide format
colnames(X) <- paste(colnames(X), x[1,"SYM"], sep = ".")
# function to take median of each column
colMedian <- function(obj, ...) {
apply(obj, 2, median, ...)
}
# aggregate by seconds
period.apply(X, endpoints(X, "seconds"), colMedian)
}
# now you can call the function on each symbol, then merge the results
do.call(merge, lapply(split(jojo, jojo$SYM), unDuplicate))
Upvotes: 1