artifex_knowledge
artifex_knowledge

Reputation: 572

Fastest way to parse all Excel formulas using Python 3.5

As part of a bigger set of tests I need to extract all the formulas within an uploaded Excel workbook. I then need to parse each formula into its respective range references and dump those references into a simple database. For example, if Cell A1 has a formula =B1 + C1 then my database would record B1 and C1 as referenced cells.

Currently I read formulas one at a time using openpyxl and then parse them. This is fine for smaller workbooks, but for large workbooks it can be very slow. It feels entirely inefficient.

Could pandas or a similar module extract Excel formulas faster? Or is there perhaps a better way to extract all workbook formulas than reading it one cell at a time?

Any advice would be highly appreciated.

Upvotes: 2

Views: 1557

Answers (2)

Charlie Clark
Charlie Clark

Reputation: 19497

What do you mean by "extracting the formulae faster"? They are stored with each cell so you have to go cell by cell. When it comes to parsing, openpyxl includes a tokeniser which you might find useful. In theory this would allow you to read the worksheet XML files directly and only parse the nodes with formulae in them. However, you'd also have to handle the "shared formulae" that some applications use. openpyxl automatically converts such formulae into per-cell ones.

Internally Pandas relies on xlrd to read the files, so the ETL of getting the stuff into Pandas won't be faster than working directly with worksheet objects.

Upvotes: 2

Charles Williams
Charles Williams

Reputation: 23505

Don't know about python, but a fast approach to the problem is:

  • get all the formulas in R1C1 mode into an array using specialcells
  • feed into a collection/dictionary to get uniques
  • then parse the uniques

Upvotes: 0

Related Questions