Reputation: 69
I'm working with Apache Poi XSSFWorkbooks to manipulate xlsx files; my program works fine on small excel files (60 000 rows). When I started to test my code on a big file (700 000 rows) I had a memory problem. I test my code on a computer with 16 GB of RAM and it doesn't work.
Any help with this issue? I read about SAX parser but I don't want to modify my code, moreover I don't find it intuitive to use; it's not simple as xssf which have simple methods to get cells,rows..etc
Is there a way to keep my code as it is and solve the memory problem? Or any solutions apart from SAX parser? Any help is appreciated, thanks.
Upvotes: 1
Views: 5319
Reputation: 4506
From experience, SAX really helps a lot with memory performance. Went from 4GB+ to around 300MB.
Some useful links and other tips:
From https://poi.apache.org/spreadsheet/limitations.html
File sizes/Memory usage
There are some inherent limits in the Excel file formats. These are defined in class SpreadsheetVersion. As long as you have enough main-memory, you should be able to handle files up to these limits. For huge files using the default POI classes you will likely need a very large amount of memory.
There are ways to overcome the main-memory limitations if needed: For writing very huge files, there is SXSSFWorkbook which allows to do a streaming write of data out to files (with certain limitations on what you can do as only parts of the file are held in memory). For reading very huge files, take a look at the sample XLSX2CSV which shows how you can read a file in streaming fashion (again with some limitations on what information you can read out of the file, but there are ways to get at most of it if necessary).
Also
https://poi.apache.org/faq.html#faq-N10165
- I think POI is using too much memory! What can I do? This one comes up quite a lot, but often the reason isn't what you might initially think. So, the first thing to check is - what's the source of the problem? Your file? Your code? Your environment? Or Apache POI?
(If you're here, you probably think it's Apache POI. However, it often isn't! A moderate laptop, with a decent but not excessive heap size, from a standing start, can normally read or write a file with 100 columns and 100,000 rows in under a couple of seconds, including the time to start the JVM).
Apache POI ships with a few programs and a few example programs, which can be used to do some basic performance checks. For testing file generation, the class to use is in the examples package, SSPerformanceTest (viewvc). Run SSPerformanceTest with arguments of the writing type (HSSF, XSSF or SXSSF), the number rows, the number of columns, and if the file should be saved. If you can't run that with 50,000 rows and 50 columns in HSSF and SXSSF in under 3 seconds, and XSSF in under 10 seconds (and ideally all 3 in less than that!), then the problem is with your environment.
Next, use the example program ToCSV (viewvc) to try reading the a file in with HSSF or XSSF. Related is XLSX2CSV (viewvc), which uses SAX parsing for .xlsx. Run this against both your problem file, and a simple one generated by SSPerformanceTest of the same size. If this is slow, then there could be an Apache POI problem with how the file is being processed (POI makes some assumptions that might not always be right on all files). If these tests are fast, then any performance problems are in your code!
And
Files vs InputStreams http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream
When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file.
If using WorkbookFactory, it's very easy to use one or the other:
// Use a file
Workbook wb = WorkbookFactory.create(new File("MyExcel.xls"));
// Use an InputStream, needs more memory
Workbook wb = WorkbookFactory.create(new FileInputStream("MyExcel.xlsx"));
If using HSSFWorkbook or XSSFWorkbook directly, you should generally go through NPOIFSFileSystem or OPCPackage, to have full control of the lifecycle (including closing the file when done):
// HSSFWorkbook, File
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("file.xls"));
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
....
fs.close();
// HSSFWorkbook, InputStream, needs more memory
NPOIFSFileSystem fs = new NPOIFSFileSystem(myInputStream);
HSSFWorkbook wb = new HSSFWorkbook(fs.getRoot(), true);
// XSSFWorkbook, File
OPCPackage pkg = OPCPackage.open(new File("file.xlsx"));
XSSFWorkbook wb = new XSSFWorkbook(pkg);
....
pkg.close();
// XSSFWorkbook, InputStream, needs more memory
OPCPackage pkg = OPCPackage.open(myInputStream);
XSSFWorkbook wb = new XSSFWorkbook(pkg);
....
pkg.close();
Upvotes: 5