Reputation: 18238
Recently I have been running some data analysis programs that are ( seem to be ) very data intensive. I have a fairly large ~ 6 million rows, 20 columns dataset. The program is in Python using pandas
and numpy
mostly. There are also many operations performed on the data.
I have done things along the way after using the profiler to improve the code. I created an index on the database table for the DataDate
columns, which increased the speed a lot, but still the bottleneck in the code is the pandas read_sql
function by far.
My program often wants to access only small pieces, like 15 rows, of the whole dataset. It seems to me like there is a tradeoff between the size of the data retrieved, and the number of times the program retrieves the info. For the following pic, you can see the read_sql
function took 761, the whole program took about 790 to complete.
The program only has 1 subroutine and 1 line which calls read_sql:
this_trade = pandas.read_sql("""select * from options where "OptionRoot" = '%s' and "DataDate" between '%s' and '%s' order by "DataDate" asc limit 21;""" % (row['OptionRoot'],row['DataDate'],row['Expiration']), engine)
I tried loading the whole options
table once at the beginning into a pandas
dataframe, and then accessing the data from the dataframe in the subroutine:
this_trade = options[ ( options['OptionRoot'] == row['OptionRoot'] ) & (options['DataDate'].between(row['DataDate'],row['Expiration']) ) ]
This was going so slowly that i didn't even want to wait for the profiler output.
So the question is: Can I speed up my data reads somehow?
Upvotes: 0
Views: 65
Reputation: 5950
You could also include OptionRoot
in that index, depending on how many distinct values it holds. If it is only a few (like a boolean) then it is better to not index such column.
Second option (not exclusive) is to partition this table by DataDate
or OptionRoot
. This should give you really big speed boost if your date range (or OptionRoot
) covers only small subset of your data, since determining what partition needs to be accessed is very fast and then you need to work with smaller set, like for example 1m rows instead of 20m.
Upvotes: 1