StarFish
StarFish

Reputation: 77

How to parse huge csv file efficiently in java

My application is currently using CSV Parser to parse csv files and persist to database. It loads the entire csv into memory and taking a lot of time to persist , sometimes even times out. I have seen on the site
seeing mixed recommendations to use Univocity parser. Please advice the best approach to process large amounts of data which takes less time.
Thank you.

Code:

 int numRecords = csvParser.parse( fileBytes );

  public int parse(InputStream ins) throws ParserException {
    long parseTime=  System.currentTimeMillis();
    fireParsingBegin();
    ParserEngine engine = null;
    try {
        engine = (ParserEngine) getEngineClass().newInstance();
    } catch (Exception e) {
        throw new ParserException(e.getMessage());
    }
    engine.setInputStream(ins);
    engine.start();
    int count = parse(engine);
    fireParsingDone();
    long seconds = (System.currentTimeMillis() - parseTime) / 1000;
    System.out.println("Time taken is "+seconds);
    return count;
}


protected int parse(ParserEngine engine) throws ParserException {
    int count = 0;
    while (engine.next()) //valuesString Arr in Engine populated with cell data
    {
        if (stopParsing) {
            break;
        }

        Object o = parseObject(engine); //create individual Tos
        if (o != null) {
            count++; //count is increased after every To is formed
            fireObjectParsed(o, engine); //put in into Bo/COl and so valn preparations
        }
        else {
            return count;
        }
    }
    return count;

Upvotes: 6

Views: 11657

Answers (3)

Sergey Nemchinov
Sergey Nemchinov

Reputation: 1616

Streaming with Apache Commons IO

try (LineIterator it = FileUtils.lineIterator(theFile, "UTF-8")) {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
}

Upvotes: 0

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. The problems you are having come from possibly 2 things:

1 - loading everything in memory. That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. Give it more memory using flags -Xms8G and Xmx8G for example.

2 - you are probably not batching your insert statements.

My suggestion is to try this (using univocity-parsers):

    //configure input format using
    CsvParserSettings settings = new CsvParserSettings();

    //get an interator
    CsvParser parser = new CsvParser(settings);
    Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

    //connect to the database and create an insert statement
    Connection connection = getYourDatabaseConnectionSomehow();
    final int COLUMN_COUNT = 2;
    PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)"); 

    //run batch inserts of 1000 rows per batch
    int batchSize = 0;
    while (it.hasNext()) {
        //get next row from parser and set values in your statement
        String[] row = it.next(); 
        for(int i = 0; i < COLUMN_COUNT; i++){ 
            if(i < row.length){
                statement.setObject(i + 1, row[i]);
            } else { //row in input is shorter than COLUMN_COUNT
                statement.setObject(i + 1, null);   
            }
        }

        //add the values to the batch
        statement.addBatch();
        batchSize++;

        //once 1000 rows made into the batch, execute it
        if (batchSize == 1000) {
            statement.executeBatch();
            batchSize = 0;
        }
    }
    // the last batch probably won't have 1000 rows.
    if (batchSize > 0) {
        statement.executeBatch();
    }

This should execute pretty quickly and you won't need not even 100mb of memory to run.

For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. Your actual code must handle that.

Hope it helps.

Upvotes: 2

Jerin Joseph
Jerin Joseph

Reputation: 1097

Use the Commons CSV Library by Apache.

Upvotes: 1

Related Questions