jeffery_the_wind
jeffery_the_wind

Reputation: 18238

Program ( Time ) Bottleneck is Database Interaction

Recently I have been running some data analysis programs that are ( seem to be ) very data intensive. I have a fairly large ~ 6 million rows, 20 columns dataset. The program is in Python using pandas and numpy mostly. There are also many operations performed on the data.

I have done things along the way after using the profiler to improve the code. I created an index on the database table for the DataDate columns, which increased the speed a lot, but still the bottleneck in the code is the pandas read_sql function by far.

My program often wants to access only small pieces, like 15 rows, of the whole dataset. It seems to me like there is a tradeoff between the size of the data retrieved, and the number of times the program retrieves the info. For the following pic, you can see the read_sql function took 761, the whole program took about 790 to complete. enter image description here

The program only has 1 subroutine and 1 line which calls read_sql:

this_trade = pandas.read_sql("""select * from options where "OptionRoot" = '%s' and "DataDate" between '%s' and '%s' order by "DataDate" asc limit 21;""" % (row['OptionRoot'],row['DataDate'],row['Expiration']), engine)

I tried loading the whole options table once at the beginning into a pandas dataframe, and then accessing the data from the dataframe in the subroutine:

this_trade = options[ ( options['OptionRoot'] == row['OptionRoot'] ) & (options['DataDate'].between(row['DataDate'],row['Expiration']) ) ]

This was going so slowly that i didn't even want to wait for the profiler output.

So the question is: Can I speed up my data reads somehow?

Upvotes: 0

Views: 65

Answers (1)

Łukasz Kamiński
Łukasz Kamiński

Reputation: 5950

You could also include OptionRoot in that index, depending on how many distinct values it holds. If it is only a few (like a boolean) then it is better to not index such column.

Second option (not exclusive) is to partition this table by DataDate or OptionRoot. This should give you really big speed boost if your date range (or OptionRoot) covers only small subset of your data, since determining what partition needs to be accessed is very fast and then you need to work with smaller set, like for example 1m rows instead of 20m.

Upvotes: 1

Related Questions