Reputation: 2468
I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:
CREATE TABLE public.readings (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
modem_serial CHARACTER VARYING(255) NOT NULL,
channel1 INTEGER NOT NULL,
channel2 INTEGER NOT NULL,
signal_strength INTEGER,
battery INTEGER,
excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.
Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.
Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:
INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
SELECT * FROM readings
WHERE timestamp=temporary_readings.timestamp
AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;
This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:
Thanks in advance!
Upvotes: 2
Views: 1054
Reputation: 53734
Adding to a table that already contains 100 million indexed records will be slow no matter what! You can probably speed things up somewhat by taking a fresh look at your indexes.
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
At the moment you have three indexes but they are on the same combination of columns. Can't you manage with just the unique index?
I don't know what your other queries are like but your WHERE NOT EXISTS
query can make use of this unique index.
If you have queries with the WHERE
clause only filtering on the modem_serial field
, your unique index is unlikely to be used. However if you flip the columns in that index it will be!
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
To quote from the manual:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.
The order of the columns in the index matters.
Upvotes: 1
Reputation: 1269873
Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.
However, the distinct on
is slowing things down. Here are two ideas.
The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on
. If that fails, then run the code again with the distinct on
. This complicates the insertion logic, but it might make the average insertion much shorter.
The second is to build an index on temporary_readings(timestamp, modem_serial)
(not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.
There is a third solution which is to use on conflict
. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.
Upvotes: 3