Reputation: 77
My application is currently using CSV Parser to parse csv files and
persist to database. It loads the entire csv into memory and taking a lot
of time to persist , sometimes even times out. I have seen on the site
seeing mixed recommendations to use Univocity parser. Please advice the
best approach to process large amounts of data which takes less time.
Thank you.
Code:
int numRecords = csvParser.parse( fileBytes );
public int parse(InputStream ins) throws ParserException {
long parseTime= System.currentTimeMillis();
fireParsingBegin();
ParserEngine engine = null;
try {
engine = (ParserEngine) getEngineClass().newInstance();
} catch (Exception e) {
throw new ParserException(e.getMessage());
}
engine.setInputStream(ins);
engine.start();
int count = parse(engine);
fireParsingDone();
long seconds = (System.currentTimeMillis() - parseTime) / 1000;
System.out.println("Time taken is "+seconds);
return count;
}
protected int parse(ParserEngine engine) throws ParserException {
int count = 0;
while (engine.next()) //valuesString Arr in Engine populated with cell data
{
if (stopParsing) {
break;
}
Object o = parseObject(engine); //create individual Tos
if (o != null) {
count++; //count is increased after every To is formed
fireObjectParsed(o, engine); //put in into Bo/COl and so valn preparations
}
else {
return count;
}
}
return count;
Upvotes: 6
Views: 11657
Reputation: 1616
Streaming with Apache Commons IO
try (LineIterator it = FileUtils.lineIterator(theFile, "UTF-8")) {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
}
Upvotes: 0
Reputation: 6289
univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. The problems you are having come from possibly 2 things:
1 - loading everything in memory. That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. Give it more memory
using flags -Xms8G
and Xmx8G
for example.
2 - you are probably not batching your insert statements.
My suggestion is to try this (using univocity-parsers):
//configure input format using
CsvParserSettings settings = new CsvParserSettings();
//get an interator
CsvParser parser = new CsvParser(settings);
Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();
//connect to the database and create an insert statement
Connection connection = getYourDatabaseConnectionSomehow();
final int COLUMN_COUNT = 2;
PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)");
//run batch inserts of 1000 rows per batch
int batchSize = 0;
while (it.hasNext()) {
//get next row from parser and set values in your statement
String[] row = it.next();
for(int i = 0; i < COLUMN_COUNT; i++){
if(i < row.length){
statement.setObject(i + 1, row[i]);
} else { //row in input is shorter than COLUMN_COUNT
statement.setObject(i + 1, null);
}
}
//add the values to the batch
statement.addBatch();
batchSize++;
//once 1000 rows made into the batch, execute it
if (batchSize == 1000) {
statement.executeBatch();
batchSize = 0;
}
}
// the last batch probably won't have 1000 rows.
if (batchSize > 0) {
statement.executeBatch();
}
This should execute pretty quickly and you won't need not even 100mb of memory to run.
For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. Your actual code must handle that.
Hope it helps.
Upvotes: 2