Reputation: 15329
I'm working on a project to make the process of ingesting a 40GB csv into a Postgres table used in a Rails app idempotent. If the ingestion worker were to fail, the worker should be re-run and the ingestion would pick back up where it left off or do nothing if it previously completed successfully.
I've addressed making the retrieval of the CSV file idempotent, but the part I'm unsure of is in regard to how Postgres handles a failure during a COPY
operation.
The import is routine and resembles:
copy my_table (id,col1, col2, etc) from 'large.csv' with (DELIMITER ',')
The most I could find about failures comes from the doc:
COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO, but the target table will already have received earlier rows in a COPY FROM. These rows will not be visible or accessible, but they still occupy disk space.
Is it possible a COPY
failure while inserting a row would leave missing or corrupt data in some of the columns?
If the COPY
command were 50% through the 40GB file when it fails, what would be the most efficient way to insert the remaining rows on retry?
Upvotes: 2
Views: 850
Reputation: 324771
Is it possible a COPY failure while inserting a row would leave missing or corrupt data in some of the columns?
No. It uses some disk space, which is re-used when you add data to the tables, or freed when you VACUUM FULL
the table. But it's not visible at the SQL level in any way.
As far as the SQL level goes, the COPY
is completely undone. You cannot resume, because there is nothing to resume. The transaction rolled back, and all copied data was discarded.
You'll want to COPY
in batches, using savepoints or interim commits. Or use an external loader.
Upvotes: 1