How to retry Postgres COPY after error with large CSV

Question

I'm working on a project to make the process of ingesting a 40GB csv into a Postgres table used in a Rails app idempotent. If the ingestion worker were to fail, the worker should be re-run and the ingestion would pick back up where it left off or do nothing if it previously completed successfully.

I've addressed making the retrieval of the CSV file idempotent, but the part I'm unsure of is in regard to how Postgres handles a failure during a COPY operation.

The import is routine and resembles:

copy my_table (id,col1, col2, etc) from 'large.csv' with (DELIMITER ',')

The most I could find about failures comes from the doc:

COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO, but the target table will already have received earlier rows in a COPY FROM. These rows will not be visible or accessible, but they still occupy disk space.

Is it possible a COPY failure while inserting a row would leave missing or corrupt data in some of the columns?

If the COPY command were 50% through the 40GB file when it fails, what would be the most efficient way to insert the remaining rows on retry?

Craig Ringer · Accepted Answer

Is it possible a COPY failure while inserting a row would leave missing or corrupt data in some of the columns?

No. It uses some disk space, which is re-used when you add data to the tables, or freed when you VACUUM FULL the table. But it's not visible at the SQL level in any way.

As far as the SQL level goes, the COPY is completely undone. You cannot resume, because there is nothing to resume. The transaction rolled back, and all copied data was discarded.

You'll want to COPY in batches, using savepoints or interim commits. Or use an external loader.

How to retry Postgres COPY after error with large CSV

Answers (1)

Related Questions