Efficient inserts with duplicate checks for large tables in Postgres

Question

I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:

CREATE TABLE public.readings (
  id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
  created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
  timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  modem_serial CHARACTER VARYING(255) NOT NULL,
  channel1 INTEGER NOT NULL,
  channel2 INTEGER NOT NULL,
  signal_strength INTEGER,
  battery INTEGER,
  excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);

It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.

Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.

Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:

INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
    SELECT * FROM readings
    WHERE timestamp=temporary_readings.timestamp
    AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;

This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:

Is this the best way to approach the problem? I'm relatively new to projects with these sorts of performance demands, so I'm curious to know if there are better solutions.
What steps can I take to speed up the insert process?

Thanks in advance!

Gordon Linoff · Accepted Answer

Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.

However, the distinct on is slowing things down. Here are two ideas.

The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.

The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.

There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.

Efficient inserts with duplicate checks for large tables in Postgres

Answers (2)

Related Questions