Reputation: 572
As part of a bigger set of tests I need to extract all the formulas within an uploaded Excel workbook. I then need to parse each formula into its respective range references and dump those references into a simple database. For example, if Cell A1 has a formula =B1 + C1 then my database would record B1 and C1 as referenced cells.
Currently I read formulas one at a time using openpyxl and then parse them. This is fine for smaller workbooks, but for large workbooks it can be very slow. It feels entirely inefficient.
Could pandas or a similar module extract Excel formulas faster? Or is there perhaps a better way to extract all workbook formulas than reading it one cell at a time?
Any advice would be highly appreciated.
Upvotes: 2
Views: 1557
Reputation: 19497
What do you mean by "extracting the formulae faster"? They are stored with each cell so you have to go cell by cell. When it comes to parsing, openpyxl includes a tokeniser which you might find useful. In theory this would allow you to read the worksheet XML files directly and only parse the nodes with formulae in them. However, you'd also have to handle the "shared formulae" that some applications use. openpyxl automatically converts such formulae into per-cell ones.
Internally Pandas relies on xlrd to read the files, so the ETL of getting the stuff into Pandas won't be faster than working directly with worksheet objects.
Upvotes: 2
Reputation: 23505
Don't know about python, but a fast approach to the problem is:
Upvotes: 0