Reputation: 5
How can I fix the code below so it will run fast enough to process a 3.2 million line query without choking in the calculation section like it is now. Fast here means hours maximum but not days, and I suspect minutes is reasonable on the PC it's running.
I have a local sql database and I'm trying to pull down data calculate a new value and write it back to a different table. However currently the calculation code is super slow. I am processing approximately 3.2 million records so this is a problem. All of this is running in visual studio R.
I've never used R before so I'm not sure what is what. In fact I had a lot of trouble finding good documentation to get to this point. I suspect from the lack of speed and how it seems to slow down that in the loop I am either reading data using an index on a list or writing to a list that doesn't use a link to the end.
library(RODBC)
db <- odbcConnect("LocalDB")
results <- sqlQuery(db,'Select * from stockAppData;')
print("Completed loading data!")
results$pDate = as.Date(results$pDate)
results$Symbol = as.character(results$Symbol)
results$Exchange = as.character(results$Exchange)
print("Completed formating data")
#Table to temporarily hold results that will be uploaded to sql db.
table <- data.frame(Symbol = character(),
Exchange = character(),
pDate = as.Date(character()),
pCloseChange = double(),
stringsAsFactors = FALSE)
print("MADE TABLE")
#This is the loop that seems to be super slow
for (i in 2:dim(results[1])) {
s <- results$Symbol[i]
e <- results$Exchange[i]
d <- results$pDate[i]
if (results$Symbol[i] == results$Symbol[i - 1]) {
pcc <- (results$pClose[i] / results$pClose[i - 1]) - 1
table <- rbind(table, c(s,e,d,pcc))
} else {
cat("Calculated Pcgain for: ", results$Symbol[i-1] , "\n" ,sep = "")
}
}
#Never been here because the loop takes forever
cat("Finished Calculations: returning results to DB")
columnTypes <- list(Symbol = "VARCHAR(10)", Exchange = "Varchar(5)", pDate = "date", pCloseChange = "DOUBLE PRECISION")
sqlSave(db,table,varTypes = columnTypes, rownames = FALSE,colnames = FALSE)
odbcClose(db)
Other things I suspect are slow but not to slow is the character compare I had a lot of trouble comparing factors and I don't know how to do it properly. I also don't necessarily need double precision, but I prefer it since it will be a requirement in additional calculations.
Each time the loop is ran we evaluate the variable pcc which stands for "Percent Change of Close" which is equal to the previous days close divided by the current days close minus 1. The loop simply adds this information along with the symbol, exchange and date from the record on the day we are evaluating to the results table. As explained in an answer I used RBIND in a loop which is definitely a reason it's slow.
The if statement makes sure we skip the first record for a given symbol exchange combination. This makes sure we skip the first day of any security since we cannot calculate pcc since it relies on the previous day and the current symbol would be a different security. See example below *note input and output are simplified examples.
Symbol Exchange pDate pClose
APP TSX 2018-01-13 1.00
APP TSX 2018-01-14 2.00
APP NYSE 2018-01-13 2.00
APP NYSE 2018-01-14 3.00
APPL TSX 2018-01-13 2.00
APPL TSX 2018-01-14 3.00
Symbol Exchange pDate pcc
APP TSX 2018-01-14 1.00
APP NYSE 2018-01-14 1.00
APPL TSX 2018-01-14 0.5
Upvotes: 0
Views: 327
Reputation: 25223
You can get a bigger performance boost by using data.table
as follows:
library(data.table)
setDT(results)
output <- results[,
.(pDate=pDate[-1L], pcc=(pClose/shift(pClose)-1)[-1L]),
by=.(Symbol, Exchange)]
Upvotes: 0
Reputation: 24139
There are two major factors in this code which makes it slow. The for loop and the rbind within the loop. R is vectorized language so instead of loop through a list one item at a time, one can compare an entire list at one time.
Binding involves make multiple copies of the data in R. The object creation, deletion and clean-up is time consuming.
#make sample data
Symbol<-rep(c("A", "B", "C"), each=10)
set.seed(1)
pClose<-rnorm(30, 100, 5)
results<-data.frame(Symbol, pClose, stringsAsFactors = FALSE)
library(dplyr) #need dplyr's lag function
#vectorized the math
#perform all of the calcualations including the wrong ones
pcc<- (results$pClose / lag(results$pClose))-1
#Find the rows which matches the previous row
matchingSymbol = results$Symbol == lag(results$Symbol, default="")
#create the final dataframe with the filtered data
answertable<-cbind(results[matchingSymbol,], pcc[matchingSymbol])
In this solution the calculations are vectorized and the binding is done only once. This code can be further improved but should provide a speed-up over your code should by 1000x.
A good reference is the "R inferno" http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
Upvotes: 2