Superdooperhero
Superdooperhero

Reputation: 8096

How to ignore errors with psql \copy meta-command

I am using psql with a PostgreSQL database and the following copy command:

\COPY isa (np1, np2, sentence) FROM 'c:\Downloads\isa.txt' WITH DELIMITER '|'

I get:

ERROR:  extra data after last expected column

How can I skip the lines with errors?

Upvotes: 17

Views: 51512

Answers (4)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658082

\copy is just a wrapper around SQL COPY that channels results through psql.

Postgres 17

ON_ERROR ignore now allows to skip error rows. Changes are summed up in the release notes of Postgres 17.

The Postgres 17 manual on COPY

By default, COPY will fail if it encounters an error during processing. For use cases where a best-effort attempt at loading the entire file is desired, the ON_ERROR clause can be used to specify some other behavior.

And:

ON_ERROR

Specifies how to behave when encountering an error converting a column's input value into its data type. An error_action value of stop means fail the command, while ignore means discard the input row and continue with the next one. The default is stop.

The ignore option is applicable only for COPY FROM when the FORMAT is text or csv.

A NOTICE message containing the ignored row count is emitted at the end of the COPY FROM if at least one row was discarded. When LOG_VERBOSITY option is set to verbose, a NOTICE message containing the line of the input file and the column name whose input conversion has failed is emitted for each discarded row.

To run \copy in psql (on your Windows server) discarding error rows:

\copy isa (np1, np2, sentence) FROM 'c:\Downloads\isa.txt' (DELIMITER '|', ON_ERROR ignore);

Postgres 16 or older

You cannot skip the errors without skipping the whole command. There is no more sophisticated error handling.

The Postgres 16 manual for COPY:

COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO, but the target table will already have received earlier rows in a COPY FROM. These rows will not be visible or accessible, but they still occupy disk space. This might amount to a considerable amount of wasted disk space if the failure happened well into a large copy operation. You might wish to invoke VACUUM to recover the wasted space.

Bold emphasis mine. And:

COPY FROM will raise an error if any line of the input file contains more or fewer columns than are expected.

There was an attempt to add error logging to COPY in Postgres 9.0 but it was never committed.

Solution

Fix your input file instead.

If you have one or more additional columns in your input file and the file is otherwise consistent, you might add dummy columns to your table isa and drop those afterwards. Or (cleaner with production tables) import to a temporary staging table and INSERT selected columns (or expressions) to your target table isa from there.

Related answers with detailed instructions:

Upvotes: 24

Nulik
Nulik

Reputation: 7380

It is too bad that in 25 years Postgres doesn't have -ignore-errors flag or option for COPY command. In this era of BigData you get a lot of dirty records and it can be very costly for the project to fix every outlier.

I had to make a work-around this way:

  1. Copy the original table and call it dummy_original_table
  2. in the original table, create a trigger like this:
    CREATE OR REPLACE FUNCTION on_insert_in_original_table() RETURNS trigger AS  $$  
    DECLARE
        v_rec   RECORD;
    BEGIN
        -- we use the trigger to prevent 'duplicate index' error by returning NULL on duplicates
        SELECT * FROM original_table WHERE primary_key=NEW.primary_key INTO v_rec;
        IF v_rec IS NOT NULL THEN
            RETURN NULL;
        END IF; 
        BEGIN 
            INSERT INTO original_table(datum,primary_key) VALUES(NEW.datum,NEW.primary_key)
                ON CONFLICT DO NOTHING;
        EXCEPTION
            WHEN OTHERS THEN
                NULL;
        END;
        RETURN NULL;
    END;
  1. Run a copy into the dummy table. No record will be inserted there, but all of them will be inserted in the original_table

psql dbname -c \copy dummy_original_table(datum,primary_key) FROM '/home/user/data.csv' delimiter E'\t'

Upvotes: 6

Rob Bednark
Rob Bednark

Reputation: 28212

Here's one solution -- import the batch file one line at a time. The performance can be much slower, but it may be sufficient for your scenario:

#!/bin/bash

input_file=./my_input.csv
tmp_file=/tmp/one-line.csv
cat $input_file | while read input_line; do
    echo "$input_line" > $tmp_file
    psql my_database \
     -c "\
     COPY my_table \
     FROM `$tmp_file` \
     DELIMITER '|'\
     CSV;\
    "
done

Additionally, you could modify the script to capture the psql stdout/stderr and exit status, and if the exit status is non-zero, echo $input_line and the captured stdout/stderr to stdin and/or append it to a file.

Upvotes: 2

Rob Bednark
Rob Bednark

Reputation: 28212

Workaround: remove the reported errant line using sed and run \copy again

Later versions of Postgres (including Postgres 13), will report the line number of the error. You can then remove that line with sed and run \copy again, e.g.,

#!/bin/bash
bad_line_number=5  # assuming line 5 is the bad line
sed ${bad_line_number}d < input.csv > filtered.csv

[per the comment from @Botond_Balázs ]

Upvotes: 1

Related Questions