Reputation: 11
I have connected the Express Javascript framework to PostgreSQL.
I am building small file upload project.
I have 100000 CSV data to upload and I have to do it fast.
I have to insure that user_email
is unique and store the recent value during insertion.
I did it with two approaches:
// console.log(parsedData)
// let q = "INSERT INTO employee (user_name, user_email, age, address) VALUES ($1, $2, $3, $4) ON CONFLICT (user_email) DO UPDATE SET user_name = EXCLUDED.user_name, age = EXCLUDED.age, address = EXCLUDED.address;";
// values.forEach(element => {
// // console.log(element)
// client.query(
// q,
// [element[0],element[1],element[2],element[3]],
// (err, result) => {
// if (err == null) {
// console.log(result.rowCount + " rows inserted successfully");
// values=[]
// } else {
// console.log(err);
// }
// }
// );
// });
This code iterates over each row and works fine. but the problem is that it takes 2-3 minutes to insert all data.
Approach 2:
client.query(
format("INSERT INTO employee VALUES %L ON CONFLICT (user_email) DO UPDATE SET user_name = EXCLUDED.user_name, age = EXCLUDED.age, address = EXCLUDED.address;", values),
[],
(err, result) => {
if (err == null) {
console.log(result.rowCount + " rows inserted successfully");
values=[]
} else {
console.log(err);
}
}
);
Sample data:
values= [[
'Joyce West',
'[email protected]',
'99',
'989 Joshua Lake\nNew Amber, CO 85650'
],
[
'Jon Santana',
'[email protected]',
'41',
'9974 Burns Turnpike\nPort Donald, AK 79911'
]]
parsedData=[ {
name: 'Tara Miller',
email: '[email protected]',
age: '49',
address: 'Unit 6011 Box 5872\nDPO AE 90613'
},
{
name: 'Derrick Whitney',
email: '[email protected]',
age: '55',
address: '070 Gerald Manor Suite 322\nEast Sarah, NJ 09879'
}]
In approach two I inserted all data in a single query, and it is extremely fast. But the problem with this is that it shows me the error:
ON CONFLICT DO UPDATE command cannot affect row a second time
I am looking for something like "on duplicate key update" in PostgreSQL.
Please suggest something to maintain the upload speed as well as solve the problem with the duplicates.
Upvotes: 1
Views: 933
Reputation: 658472
I am looking for something like "on duplicate key update" in PostgreSQL.
INSERT ... ON CONFLICT ... DO UPDATE
works fine in Postgres to resolve conflicts between input rows and existing rows. But it cannot resolve duplicates on key columns within input rows. You need to resolve that before you hand the set of input rows to INSERT
.
WITH dist_set AS (
SELECT DISTINCT ON (user_email) *
FROM (
VALUES ... -- your input here !
) t (user_name, user_email, age, address) -- matching column list !
-- ORDER BY ??? -- ① !!!
)
INSERT INTO employee (user_name, user_email, age, address) -- spell out target list !
SELECT *
FROM dist_set
ON CONFLICT (user_email) DO UPDATE
SET user_name = EXCLUDED.user_name
, age = EXCLUDED.age
, address = EXCLUDED.address;
① Declare a deterministic sort order to break ties in consistent fashion. Without ORDER BY
, you get an arbitrary pick - the first row in the input list in simple cases. But no guarantees are made without ORDER BY
.
Notably, use DISTINCT ON ()
, not just DISTINCT
. The latter would only remove duplicates on the whole row, we must reduce to a single row per (set of) key column(s).
See:
Once the input set is large enough, it pays to use a temporary table as stepping stone, insert data with a fast COPY
command, create an appropriate index to help deduplication, and UPSERT from there:
-- SET temp_buffers = '100MB'; -- big enough to hold tbl & idx !
CREATE TEMP TABLE tmp_emp (
user_name text
, user_email text
, age int
, address text
);
COPY tmp_emp FROM '/absolute/path/to/file' (FORMAT csv);
CREATE INDEX ON tmp_emp (user_email);
INSERT INTO employee (user_name, user_email, age, address) -- spell out target list !
SELECT DISTINCT ON (user_email) *
FROM tmp_emp
ON CONFLICT ...
-- DROP pg_temp.tmp_emp; -- optional
Adapt the index to your actual deduplication logic. Depending on data distribution, there are smart queries for large sets to optimize performance. See:
If you cannot use COPY
, consider psql's \copy
, which is more easily accessible. See:
Aside: A column "age" is almost always a design error, and should be something like "birthday" instead.
Upvotes: 3