Rajib_Podder
Rajib_Podder

Reputation: 91

Read Huge Excel file(500K rows) in java

I am trying to read a Big XLSX File. The Excel file has around 500K rows.I need to read col 2.

OPCPackage pkg;
pkg = OPCPackage.open("File path");
XSSFWorkbook myWorkBook = new XSSFWorkbook(pkg);
Sheet sheet = myWorkBook.getSheetAt(2);
Iterator<Row> rowIterator = sheet.iterator();
while (rowIterator.hasNext())
{
Row row = rowIterator.next();
if (row_num > ROW_ESCAPE) 
{
   Cell cell = row.getCell(2);
  if (!cell.getStringCellValue().toString().trim().isEmpty()) 
            {
                System.out.println(cell.getStringCellValue().toString());
            }
System.out.println("hi"+row_num);
        }
        row_num++;
 }

It prints till row 39723 After that it throws the below exception

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Matcher.<init>(Matcher.java:225)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at org.apache.poi.xssf.usermodel.XSSFRichTextString.utfDecode(XSSFRichTextString.java:482)
at org.apache.poi.xssf.usermodel.XSSFRichTextString.getString(XSSFRichTextString.java:297)
at org.apache.poi.xssf.usermodel.XSSFCell.getStringCellValue(XSSFCell.java:262)
at Main.get_titles(Main.java:484)
at Main.analyze_Importsheet(Main.java:461)
at Main.but_sel_imp_sheetActionPerformed(Main.java:220)
at Main.access$000(Main.java:40)
at Main$1.actionPerformed(Main.java:85)
at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022)
at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348)
at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259)
at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:252)
at java.awt.Component.processMouseEvent(Component.java:6533)
at javax.swing.JComponent.processMouseEvent(JComponent.java:3324)
at java.awt.Component.processEvent(Component.java:6298)
at java.awt.Container.processEvent(Container.java:2236)
at java.awt.Component.dispatchEventImpl(Component.java:4889)
at java.awt.Container.dispatchEventImpl(Container.java:2294)
at java.awt.Component.dispatchEvent(Component.java:4711)
at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4888)
at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4525)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4466)
at java.awt.Container.dispatchEventImpl(Container.java:2280)
at java.awt.Window.dispatchEventImpl(Window.java:2746)
at java.awt.Component.dispatchEvent(Component.java:4711)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:758)
at java.awt.EventQueue.access$500(EventQueue.java:97)
at java.awt.EventQueue$3.run(EventQueue.java:709)
at java.awt.EventQueue$3.run(EventQueue.java:703)

Main.java:484=if (!cell.getStringCellValue().toString().trim().isEmpty()) If i remove that line and just print the row number, it works fine. I need help how to get the string value for col 2.

Upvotes: 9

Views: 24090

Answers (5)

N.Neupane
N.Neupane

Reputation: 291

This library is available from from Maven Central, and you can optionally install it yourself.

<!-- POI for parsing Excel files-->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>4.1.2</version>
    </dependency>

    <!-- POI-ooxml -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>4.1.2</version>
    </dependency>

    <!-- For reading very large Excel file -->
    <dependency>
        <groupId>com.monitorjbl</groupId>
        <artifactId>xlsx-streamer</artifactId>
        <version>2.1.0</version>
    </dependency>

To use it, add this to your POM: Just paste the code with the class name ReadLargeFile.java, and see the magic,

import code.axis.properties.ConfigReader;
import com.monitorjbl.xlsx.StreamingReader;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.NumberToTextConverter;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.regex.Pattern;

public class ReadLargeFile {
     public static void main(String[] args) {
       try (InputStream inputStream = new FileInputStream(new File("C:/Users/Nischal/Desktop/Qualtiy Assurance of Data Clener/Extra Large Files/update_fileName01-26-2021-6-34-49.XLSX"))) { //FilePath from your device
        Workbook workbook = StreamingReader.builder().rowCacheSize(200).bufferSize(4096).open(inputStream);
        for (Sheet sheet : workbook) {
            for (Row row : sheet) {
                for (Cell cell : row) {
                    String cellValue = getStringCellValue(cell);
                    System.out.println(cellValue);
                }
            }
        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}

private static String getStringCellValue(Cell cell) {
    try {
        switch (cell.getCellType()) {
            case FORMULA:
                try {
                    return NumberToTextConverter.toText(cell.getNumericCellValue());
                } catch (NumberFormatException e) {
                    return cell.getStringCellValue();
                }
            case NUMERIC:
                return NumberToTextConverter.toText(cell.getNumericCellValue());
            case STRING:
                String cellValue = cell.getStringCellValue().trim();
                String pattern = "\\^\\$?-?([1-9][0-9]{0,2}(,\\d{3})*(\\.\\d{0,2})?|[1-9]\\d*(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))$|^-?\\$?([1-9]\\d{0,2}(,\\d{3})*(\\.\\d{0,2})?|[1-9]\\d*(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))$|^\\(\\$?([1-9]\\d{0,2}(,\\d{3})*(\\.\\d{0,2})?|[1-9]\\d*(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))\\)$";
                if (((Pattern.compile(pattern)).matcher(cellValue)).find()) {
                    return cellValue.replaceAll("[^\\d.]", "");
                }
                return cellValue.trim();
            case BOOLEAN:
                return String.valueOf(cell.getBooleanCellValue());
            case ERROR:
                return null;
            default:
                return cell.getStringCellValue();
        }
    } catch (Exception e) {
        if (e.getLocalizedMessage() != null && ConfigReader.isDisplayWarnLog())
            return "";
    }
    return "";
}
}

The code prints the cell value one by one no matter how large the excel file is.

Upvotes: 0

jiang_anwei
jiang_anwei

Reputation: 19

you need see this https://github.com/monitorjbl/excel-streaming-reader

you can code like this

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .open(is);  

      // InputStream or File for XLSX file (required)

Upvotes: 1

Andreas
Andreas

Reputation: 159076

Use the streaming eventmodel API, not the in-memory usermodel API.

See Apache POI: How to use the HSSF Event API.

Upvotes: 5

subwaymatch
subwaymatch

Reputation: 1040

The easiest way (without chaning your read logic) would be to increase heap size.

If that's not a viable option for you, use stream. Actually, there is a convenient library already available.

https://github.com/monitorjbl/excel-streaming-reader

Upvotes: 1

StvnBrkdll
StvnBrkdll

Reputation: 4044

Increasing the JVM's heap size will likely fix your OutOfMemoryError. See this stackoverflow post for how to increase the JVM's heap size.

Upvotes: 0

Related Questions