Reputation: 469
I have fairly large CSV files which I need to parse and then persist into PostgreSQL. For example, one file contains 2_070_000 records which I was able to parse and persist in ~8 minutes (single thread). Is it possible to persist them using multiple threads?
public void importCsv(MultipartFile csvFile, Class<T> targetClass) {
final var headerMapping = getHeaderMapping(targetClass);
File tempFile = null;
try {
final var randomUuid = UUID.randomUUID().toString();
tempFile = File.createTempFile("data-" + randomUuid, "csv");
csvFile.transferTo(tempFile);
final var csvFileName = csvFile.getOriginalFilename();
final var csvReader = new BufferedReader(new FileReader(tempFile, StandardCharsets.UTF_8));
Stopwatch stopWatch = Stopwatch.createStarted();
log.info("Starting to import {}", csvFileName);
final var csvRecords = CSVFormat.DEFAULT
.withDelimiter(';')
.withHeader(headerMapping.keySet().toArray(String[]::new))
.withSkipHeaderRecord(true)
.parse(csvReader);
final var models = StreamSupport.stream(csvRecords.spliterator(), true)
.map(record -> parseRecord(record, headerMapping, targetClass))
.collect(Collectors.toUnmodifiableList());
// How to save such a large list?
log.info("Finished import of {} in {}", csvFileName, stopWatch);
} catch (IOException ex) {
ex.printStackTrace();
} finally {
tempFile.delete();
}
}
models contains a lot of records. The parsing into records is done using parallel stream, so it's quite fast. I'm afraid to call SimpleJpaRepository.saveAll, because I'm not sure what it will do under the hood.
The question is: What is the most efficient way to persist such a large list of entities?
P.S.: Any other improvements are greatly appreciated.
Upvotes: 2
Views: 1916
Reputation: 19956
You have to use batch inserts.
SomeRepositoryCustom
public interface SomeRepositoryCustom {
void batchSave(List<Record> records);
}
SomeRepositoryCustom
@Repository
class SomesRepositoryCustomImpl implements SomeRepositoryCustom {
private JdbcTemplate template;
@Autowired
public SomesRepositoryCustomImpl(JdbcTemplate template) {
this.template = template;
}
@Override
public void batchSave(List<Record> records) {
final String sql = "INSERT INTO RECORDS(column_a, column_b) VALUES (?, ?)";
template.execute(sql, (PreparedStatementCallback<Void>) ps -> {
for (Record record : records) {
ps.setString(1, record.getA());
ps.setString(2, record.getB());
ps.addBatch();
}
ps.executeBatch();
return null;
});
}
}
JpaRepository
with SomeRepositoryCustom
@Repository
public interface SomeRepository extends JpaRepository, SomeRepositoryCustom {
}
to save
someRepository.batchSave(records);
Notes
Keep in mind that, if you are even using batch inserts, database driver will not use them. For example, for MySQL, it is necessary to add a parameter rewriteBatchedStatements=true
to database URL.
So better to enable driver SQL logging (not Hibernate) to verify everything. Also can be useful to debug driver code.
You will need to make decision about splitting records by packets in the loop
for (Record record : records) {
}
A driver can do it for you, so you will not need it. But better to debug this thing too.
P. S. Don't use var
everywhere.
Upvotes: 1