Hana
Hana

Reputation: 824

Writing large of data to excel: GC overhead limit exceeded

I have a list of strings in read from MongoDB (~200k lines) Then I want to write it to an excel file with Java code:

public class OutputToExcelUtils {

    private static XSSFWorkbook workbook;
    private static final String DATA_SEPARATOR = "!";

    public static void clusterOutToExcel(List<String> data, String outputPath) {

        workbook = new XSSFWorkbook();
        FileOutputStream outputStream = null;

        writeData(data, "Data");


        try {
            outputStream = new FileOutputStream(outputPath);            
            workbook.write(outputStream);
            workbook.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void writeData(List<String> data, String sheetName) {

        int rowNum = 0;
        XSSFSheet sheet = workbook.getSheet(sheetName);     
        sheet = workbook.createSheet(sheetName);


        for (int i = 0; i < data.size(); i++) {
            System.out.println(sheetName + " Processing line: " + i);
            int colNum = 0;
            // Split into value of cell
            String[] valuesOfLine = data.get(i).split(DATA_SEPERATOR);

            Row row = sheet.createRow(rowNum++);

            for (String valueOfCell : valuesOfLine) {
                Cell cell = row.createCell(colNum++);
                cell.setCellValue(valueOfCell);
            }
        }
    }

}

Then I get an error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.xmlbeans.impl.store.Cur$Locations.(Cur.java:497) at org.apache.xmlbeans.impl.store.Locale.(Locale.java:168) at org.apache.xmlbeans.impl.store.Locale.getLocale(Locale.java:242) at org.apache.xmlbeans.impl.store.Locale.newInstance(Locale.java:593) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.newInstance(SchemaTypeLoaderBase.java:198) at org.apache.poi.POIXMLTypeLoader.newInstance(POIXMLTypeLoader.java:132) at org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst$Factory.newInstance(Unknown Source) at org.apache.poi.xssf.usermodel.XSSFRichTextString.(XSSFRichTextString.java:87) at org.apache.poi.xssf.usermodel.XSSFCell.setCellValue(XSSFCell.java:417) at ups.mongo.excelutil.OutputToExcelUtils.writeData(OutputToExcelUtils.java:80) at ups.mongo.excelutil.OutputToExcelUtils.clusterOutToExcel(OutputToExcelUtils.java:30) at ups.mongodb.App.main(App.java:74)

Please give me some advice for that?

Thank you with my respect.

Update solution: Using SXSSWorkbook instead of XSSWorkbook

public class OutputToExcelUtils {

    private static SXSSFWorkbook workbook;
    private static final String DATA_SEPERATOR = "!";

    public static void clusterOutToExcel(ClusterOutput clusterObject, ClusterOutputTrade clusterOutputTrade,
            ClusterOutputDistance ClusterOutputDistance, String outputPath) {

        workbook = new SXSSFWorkbook();
        workbook.setCompressTempFiles(true);
        FileOutputStream outputStream = null;

        writeData(clusterOutputTrade.getTrades(), "Data");

        try {
            outputStream = new FileOutputStream(outputPath);            
            workbook.write(outputStream);
            workbook.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void writeData(List<String> data, String sheetName) {

        int rowNum = 0;
        SXSSFSheet sheet = workbook.createSheet(sheetName);
        sheet.setRandomAccessWindowSize(100); // For 100 rows saved in memory, it will flushed after wirtten to excel file

        for (int i = 0; i < data.size(); i++) {
            System.out.println(sheetName + " Processing line: " + i);
            int colNum = 0;
            // Split into value of cell
            String[] valuesOfLine = data.get(i).split(DATA_SEPERATOR);

            Row row = sheet.createRow(rowNum++);

            for (String valueOfCell : valuesOfLine) {
                Cell cell = row.createCell(colNum++);
                cell.setCellValue(valueOfCell);
            }
        }
    }

}

Upvotes: 3

Views: 5686

Answers (3)

Tony the Pony
Tony the Pony

Reputation: 41347

Your application is spending too much time doing garbage collection. This doesn't necessarily mean that it is running out of heap space; however, it spends too much time in GC relative to performing actual work, so the Java runtime shuts it down.

Try to enable throughput collection with the following JVM option:

-XX:+UseParallelGC

While you're at it, give your application as much heap space as possible:

-Xms????m

(where ???? stands for the amount of heap space in MB, e.g. -Xms8192m)

If this doesn't help, try to set a more lenient throughput goal with this option:

-XX:GCTimeRatio=19 

This specifies that your application should do 19 times more useful work than GC-related work, i.e. it allows the GC to consume up to 5% of the processor time (I believe the stricter 1% default goal may be causing the above runtime error)

No guarantee that his will work. Can you check and post back so others who experience similar problems may benefit?

EDIT

Your root problem remains the fact that you need to hold the entire spreadhseet and all its related objects in memory while you are building it. Another solution would be to serialize the data, i.e. writing the actual spreadsheet file instead of constructing it in memory and saving it at the end. However, this requires reading up on the XLXS format and creating a custom solution.

Another option would be looking for a less memory-intensive library (if one exists). Possible alternatives to POI are JExcelAPI (open source) and Aspose.Cells (commercial).

I've used JExcelAPI years ago and had a positive experience (however, it appears that it is much less actively maintained than POI, so may no longer be the best choice).

EDIT 2

Looks like POI offers a streaming model (https://poi.apache.org/spreadsheet/how-to.html#sxssf), so this may be the best overall approach.

Upvotes: 4

Joop Eggen
Joop Eggen

Reputation: 109532

Instead of getting the entire list from the data, iterate line wise. If too cumbersome, write the list to a file, and reread it linewise, for instance as a Stream<String>:

  Path path = Files.createTempFile(...);
  Files.write(path, list, StandarCharsets.UTF_8);
  Files.lines(path, StandarCharsets.UTF_8)
     .forEach(line -> { ... });

On the Excel side: though xlsx uses shared strings, if XSSF was done careless, the following would use a single String instance for repeated string values.

public class StringCache {
    private static final int MAX_LENGTH = 40;
    private Map<String, String> identityMap = new Map<>();

    public String cached(String s) {
         if (s == null) {
             return null;
         }
         if (s.length() > MAX_LENGTH) {
             return s;
         }
         String t = identityMap.get(s);
         if (t == null) {
             t = s;
             identityMap.put(t, t);
         }
         return t;
    }
}

StringCache strings = new StringCache();

       for (String valueOfCell : valuesOfLine) {
            Cell cell = row.createCell(colNum++);
            cell.setCellValue(strings.cached(valueOfCell));
       }

Upvotes: 0

Alexander Petrov
Alexander Petrov

Reputation: 9482

Well try to not load all the data in memory. Even if the binary representation of 200k lines is not that big the hidrated object in memory may be too big. Just as a hint if you have a Pojo each attribute in this pojo has a pointer and each pointer depending on if it is compressed or not compressed will take 4 or 8 bytes. This mean that if your data is a Pojo with 4 attributes only for the pointers you will be spending 200 000* 4bytes(or 8 bytes).

Theoreticaly you can increase the amount of memory to the JVM, but this is not a good solution, or more precisly it is not a good solution for a Live system. For a non interactive system might be fine.

Hint: Use -Xmx -Xms jvm arguments to control the heap size.

Upvotes: 1

Related Questions