Program ( Time ) Bottleneck is Database Interaction

Question

Recently I have been running some data analysis programs that are ( seem to be ) very data intensive. I have a fairly large ~ 6 million rows, 20 columns dataset. The program is in Python using pandas and numpy mostly. There are also many operations performed on the data.

I have done things along the way after using the profiler to improve the code. I created an index on the database table for the DataDate columns, which increased the speed a lot, but still the bottleneck in the code is the pandas read_sql function by far.

My program often wants to access only small pieces, like 15 rows, of the whole dataset. It seems to me like there is a tradeoff between the size of the data retrieved, and the number of times the program retrieves the info. For the following pic, you can see the read_sql function took 761, the whole program took about 790 to complete.

The program only has 1 subroutine and 1 line which calls read_sql:

this_trade = pandas.read_sql("""select * from options where "OptionRoot" = '%s' and "DataDate" between '%s' and '%s' order by "DataDate" asc limit 21;""" % (row['OptionRoot'],row['DataDate'],row['Expiration']), engine)

I tried loading the whole options table once at the beginning into a pandas dataframe, and then accessing the data from the dataframe in the subroutine:

this_trade = options[ ( options['OptionRoot'] == row['OptionRoot'] ) & (options['DataDate'].between(row['DataDate'],row['Expiration']) ) ]

This was going so slowly that i didn't even want to wait for the profiler output.

So the question is: Can I speed up my data reads somehow?

Program ( Time ) Bottleneck is Database Interaction

Answers (1)

Related Questions