Sohail Shaikh
Sohail Shaikh

Reputation: 11

Error: ON CONFLICT DO UPDATE command cannot affect row a second time

I have connected the Express Javascript framework to PostgreSQL.
I am building small file upload project.
I have 100000 CSV data to upload and I have to do it fast.
I have to insure that user_email is unique and store the recent value during insertion.

I did it with two approaches:

 // console.log(parsedData)
      // let q = "INSERT INTO employee (user_name, user_email, age, address) VALUES ($1, $2, $3, $4) ON CONFLICT (user_email) DO UPDATE SET user_name = EXCLUDED.user_name, age = EXCLUDED.age, address = EXCLUDED.address;";


      // values.forEach(element => {
      //   // console.log(element)
      //   client.query(
      //    q,
      //     [element[0],element[1],element[2],element[3]],
      //     (err, result) => {
      //       if (err == null) {
      //         console.log(result.rowCount + " rows inserted successfully");
      //         values=[]
              
      //       } else {
      //         console.log(err);
      //       }
      //     }
      //   );
      // });

This code iterates over each row and works fine. but the problem is that it takes 2-3 minutes to insert all data.

Approach 2:

 client.query(
        format("INSERT INTO employee VALUES %L ON CONFLICT (user_email) DO UPDATE SET user_name = EXCLUDED.user_name, age = EXCLUDED.age, address = EXCLUDED.address;", values),
        [],
        (err, result) => {
          if (err == null) {
            console.log(result.rowCount + " rows inserted successfully");
            values=[]
            
          } else {
            console.log(err);
          }
        }
      );

Sample data:

values=  [[
    'Joyce West',
    '[email protected]',
    '99',
    '989 Joshua Lake\nNew Amber, CO 85650'
  ],
  [
    'Jon Santana',
    '[email protected]',
    '41',
    '9974 Burns Turnpike\nPort Donald, AK 79911'
  ]]

parsedData=[ {
    name: 'Tara Miller',
    email: '[email protected]',
    age: '49',
    address: 'Unit 6011 Box 5872\nDPO AE 90613'
  },
  {
    name: 'Derrick Whitney',
    email: '[email protected]',
    age: '55',
    address: '070 Gerald Manor Suite 322\nEast Sarah, NJ 09879'
  }]

In approach two I inserted all data in a single query, and it is extremely fast. But the problem with this is that it shows me the error:

ON CONFLICT DO UPDATE command cannot affect row a second time

I am looking for something like "on duplicate key update" in PostgreSQL.

Please suggest something to maintain the upload speed as well as solve the problem with the duplicates.

Upvotes: 1

Views: 933

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658472

I am looking for something like "on duplicate key update" in PostgreSQL.

INSERT ... ON CONFLICT ... DO UPDATE works fine in Postgres to resolve conflicts between input rows and existing rows. But it cannot resolve duplicates on key columns within input rows. You need to resolve that before you hand the set of input rows to INSERT.

Resolve conflict in input set

WITH dist_set AS (
   SELECT DISTINCT ON (user_email) *
   FROM  (
      VALUES ...                                 -- your input here !
      ) t (user_name, user_email, age, address)  -- matching column list !
-- ORDER  BY ???                                 -- ① !!!
   )
INSERT INTO employee (user_name, user_email, age, address)  -- spell out target list !
SELECT *
FROM   dist_set
ON     CONFLICT (user_email) DO UPDATE
SET    user_name = EXCLUDED.user_name
     , age = EXCLUDED.age
     , address = EXCLUDED.address;

① Declare a deterministic sort order to break ties in consistent fashion. Without ORDER BY, you get an arbitrary pick - the first row in the input list in simple cases. But no guarantees are made without ORDER BY.

Notably, use DISTINCT ON (), not just DISTINCT. The latter would only remove duplicates on the whole row, we must reduce to a single row per (set of) key column(s).

See:

Large input

Once the input set is large enough, it pays to use a temporary table as stepping stone, insert data with a fast COPY command, create an appropriate index to help deduplication, and UPSERT from there:

-- SET temp_buffers = '100MB';  -- big enough to hold tbl & idx !

CREATE TEMP TABLE tmp_emp (
  user_name text
, user_email text
, age int
, address  text
);

COPY tmp_emp FROM '/absolute/path/to/file' (FORMAT csv);

CREATE INDEX ON tmp_emp (user_email);
   
INSERT INTO employee (user_name, user_email, age, address)  -- spell out target list !
SELECT DISTINCT ON (user_email) *
FROM   tmp_emp
ON     CONFLICT ...

-- DROP pg_temp.tmp_emp;  -- optional

Adapt the index to your actual deduplication logic. Depending on data distribution, there are smart queries for large sets to optimize performance. See:

If you cannot use COPY, consider psql's \copy, which is more easily accessible. See:

Aside: A column "age" is almost always a design error, and should be something like "birthday" instead.

Upvotes: 3

Related Questions