Reputation: 43
I try to deal with multiple loops with data.table for quite some time and got frustrated. With sql it seamed quite intuitive, but with R I experience some problems.
For example, I would want to fread one txt file (as I have hundreds of files, each around 1 GB), make the calculations (sum price and quant, when time>my.time and for some selected isin's, grouped by my.time, isin and price), write the results to some csv file, remove original txt file from R memory; then redo these calculations one-by-one for all txt files and append the output csv file.
Lets start with the example data (quite small, just two identical files for illustration):
time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "1 sec"),"%H:%M:%S")
n<-length(time)
isin<-paste("US",1:n,sep="")
price<-rnorm(n,101,1)
quant<-rnorm(n,5,1)
dt<-data.table(time,isin,price,quant)
write.table(dt,"raw.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)
write.table(dt,"raw2.txt",append = FALSE,sep = ",",col.names = TRUE, row.names = FALSE)
my.files <- list.files(pattern = "raw*.txt")
my.time<-format(seq.POSIXt(as.POSIXct(Sys.Date()), as.POSIXct(Sys.Date()+1), by = "5 min"),"%H:%M:%S")
my.isin<-c("US100","US150","US225","US250","US1050")
Then I try these 2 simple loops:
for (i in my.files){
for (j in my.time){
dt<-fread(i)
write.table(dt[which(isin %in% my.isin & time>j),
.(sprice=sum(price),squant=sum(quant),**time.my=j**), by = .(isin,price)],
"output.csv",append = TRUE,sep = ",",col.names = TRUE)
rm(dt)
}}
Second edit: The loop with j finally started working for me (due to the bolded part). Maybe it would be possible to work without for loops and get the same results?
Thanks very much for help!
Upvotes: 0
Views: 333
Reputation: 1781
The problem you have is the output of your which
statement is returning zero rows. First I would convert your time to time
type. I've then created a 5 minute grouping variable.
This would first aggregate your tables.
dt[,`:=`(time= as.ITime(strptime(time, format="%H:%M:%S")))]
dt[,`:=`(time5 = format(strptime("1970-01-01", "%Y-%m-%d", tz="UTC") +
round(as.numeric(time)/300)*300,"%H:%M"))]
dt[, list(sprice = sum(price),squant= sum(quant)),by = c("time5","price","isin")][isin %in% my.isin]
# time5 price isin sprice squant
# 1: 00:00 102.46668 US1 102.46668 3.002960
# 2: 00:00 99.02186 US2 99.02186 5.253252
# 3: 00:00 100.23665 US3 100.23665 6.153950
# 4: 00:00 102.21466 US4 102.21466 3.461051
# 5: 00:00 100.97890 US5 100.97890 5.893336
you can then filter it by your my.isn
or time5 greater than custom times?
Upvotes: 2