Reputation: 945

Fastest approach to read and process 10k Excell cells in Python/Pandas?

I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:

read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.

I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.

import xlwings as xw
import numpy as np
import pandas as pd

...
...

tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns

try:
   global ttt #this is used as temporary global pandas.df
   global tttout #this is used as output global pandas.df copy
   #they are global as they can be zeroed with another function

   ttt= ttt.append(tickdf, ignore_index=False) 
   #at each loop, newly read ticks are added as rows to the end of ttt global.df.

   ttt.drop_duplicates(inplace=True)

   tttout = ttt.copy()
   #to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step

   tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
   tttout = tttout.set_index('time')
   #sort it by time/name and set time as index

   tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)] 
   #find matching values comparing an array of a dozen values

   tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
   xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout

I run this on a [email protected], and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.

Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:

'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....

being time, stock symbol, price, quantity.

I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.

My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?

Upvotes: 0

Answers (2)

Lorenzo Bassetti

Reputation: 945

@Tony Roberts, thank you

I have one doubt and one observation.

DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.

OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.

Upvotes: 0

Tony Roberts

Reputation: 417

It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.

See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.

PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.

The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).

As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.

PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.

Upvotes: 1

Fastest approach to read and process 10k Excell cells in Python/Pandas?

Answers (2)

Related Questions