What is the most efficient way to persist thousands of entities?

Question

I have fairly large CSV files which I need to parse and then persist into PostgreSQL. For example, one file contains 2_070_000 records which I was able to parse and persist in ~8 minutes (single thread). Is it possible to persist them using multiple threads?

    public void importCsv(MultipartFile csvFile, Class targetClass) {
        final var headerMapping = getHeaderMapping(targetClass);
        File tempFile = null;

        try {
            final var randomUuid = UUID.randomUUID().toString();
            tempFile = File.createTempFile("data-" + randomUuid, "csv");
            csvFile.transferTo(tempFile);

            final var csvFileName = csvFile.getOriginalFilename();
            final var csvReader = new BufferedReader(new FileReader(tempFile, StandardCharsets.UTF_8));

            Stopwatch stopWatch = Stopwatch.createStarted();
            log.info("Starting to import {}", csvFileName);
            final var csvRecords = CSVFormat.DEFAULT
                    .withDelimiter(';')
                    .withHeader(headerMapping.keySet().toArray(String[]::new))
                    .withSkipHeaderRecord(true)
                    .parse(csvReader);

            final var models = StreamSupport.stream(csvRecords.spliterator(), true)
                    .map(record -> parseRecord(record, headerMapping, targetClass))
                    .collect(Collectors.toUnmodifiableList());

           // How to save such a large list? 

            log.info("Finished import of {} in {}", csvFileName, stopWatch);
        } catch (IOException ex) {
            ex.printStackTrace();
        } finally {
            tempFile.delete();
        }
    }

models contains a lot of records. The parsing into records is done using parallel stream, so it's quite fast. I'm afraid to call SimpleJpaRepository.saveAll, because I'm not sure what it will do under the hood.

The question is: What is the most efficient way to persist such a large list of entities?

P.S.: Any other improvements are greatly appreciated.

v.ladynev · Accepted Answer

You have to use batch inserts.

Create an interface for a custom repository SomeRepositoryCustom

public interface SomeRepositoryCustom {

    void batchSave(List records);

}

Create an implementation of SomeRepositoryCustom

@Repository
class SomesRepositoryCustomImpl implements SomeRepositoryCustom {

    private JdbcTemplate template;

    @Autowired
    public SomesRepositoryCustomImpl(JdbcTemplate template) {
        this.template = template;
    }

    @Override
    public void batchSave(List records) {
        final String sql = "INSERT INTO RECORDS(column_a, column_b) VALUES (?, ?)";

        template.execute(sql, (PreparedStatementCallback) ps -> {
            for (Record record : records) {
                ps.setString(1, record.getA());
                ps.setString(2, record.getB());
                ps.addBatch();
            }
            ps.executeBatch();
            return null;
        });
    }

}

Extend your JpaRepository with SomeRepositoryCustom

@Repository
public interface SomeRepository extends JpaRepository, SomeRepositoryCustom {

}

to save

someRepository.batchSave(records);

Notes

Keep in mind that, if you are even using batch inserts, database driver will not use them. For example, for MySQL, it is necessary to add a parameter rewriteBatchedStatements=true to database URL. So better to enable driver SQL logging (not Hibernate) to verify everything. Also can be useful to debug driver code.

You will need to make decision about splitting records by packets in the loop

    for (Record record : records) { 

    }

A driver can do it for you, so you will not need it. But better to debug this thing too.

P. S. Don't use var everywhere.

What is the most efficient way to persist thousands of entities?

Answers (1)

Related Questions