Musslan
Musslan

Reputation: 111

How to optimise an Excel workbook with Apache POI

For a hobby project I'm working on, I want to use Apache POI to update, evaluate and retrieve the results from Excel files. Often, only a few cells are relevant to do this and the rest of cells just take up a lot of space, leading to performance issues when opening the files. Let's say I have the following structure

Now, I want to be able to update B2:B10 and C2:C10 and get the updated value for A1. I don't care about the other data. To optimise the size of the file, I would want to update all other cells to null and save the file. E.g., in the example above: remove column D, E and F from sheet "First" + remove sheet "Second".

One way to accomplish this would be to implement something like the following pseudocode (assuming no circular references)

Take set of cells I want to evaluate (e.g., A1) as "relevantCells"
For each cell in "relevantCells"
  Retrive all cells in the formulae and add to "relevantCells"
For each cell in workbook
  Check if cell is in "relevantCells"
  if(yes) continue
  else set cell to null
Save workbook

Is there a better way of having Apache POI automatically remove all cells that are not relevant in order to retrieve updated values for a set of cells?

Upvotes: 0

Views: 886

Answers (1)

centic
centic

Reputation: 15880

Take a look at the APIs for event-based reading of files, this way you could read the file in a streaming fashion (i.e. no OOM even on very large files) and only write out the "interesting parts" into a second Workbook, that you build up as you go along.

Upvotes: 1

Related Questions