Reputation: 140589

there must be a faster way (batch insert if not already present, postgres)

Given this database schema

CREATE TABLE a (
   datum            INTEGER NOT NULL REFERENCES datum_details(id),
   processed_datum  INTEGER REFERENCES datum_details(id),
   -- other columns
   PRIMARY KEY(datum)
);
CREATE INDEX a__processed ON a(processed_datum);

CREATE TABLE b (
   locale           TEXT NOT NULL,
   datum            INTEGER NOT NULL REFERENCES datum_details(id),
   -- other columns
   PRIMARY KEY (locale, datum)
);
CREATE INDEX b__datum ON b(datum);

I want to do this

INSERT INTO b (locale, datum)
   SELECT locale, datum FROM
      (SELECT DISTINCT processed_datum FROM a
          WHERE processed_datum IS NOT NULL) x(datum),
      -- array on next line is an example value for a query parameter
      (SELECT UNNEST(ARRAY['us', 'uk', 'cn', 'jp']) y(locale)
   EXCEPT
      SELECT locale, datum FROM b;

There are 5,532,592 rows in table a with non-null processed_datum, and there are 10 to 20 entries in the array provided to UNNEST. The real version of the above INSERT ran for nearly a full hour.

Surely there is a faster way? Fifty million 2-tuples should fit entirely in RAM and should not take more than a few seconds to sort or perform a set difference on.

Here's an EXPLAIN dump, in case that helps:

 Insert on b  (cost=65284579.58..68540566.70 rows=144531000 width=36)
   ->  Subquery Scan on "*SELECT*"  (cost=65284579.58..68540566.70 rows=144531000 width=36)
         ->  SetOp Except  (cost=65284579.58..67095256.70 rows=144531000 width=24)
               ->  Sort  (cost=65284579.58..65888138.62 rows=241423616 width=24)
                     Sort Key: "*SELECT* 1".locale, "*SELECT* 1".datum
                     ->  Append  (cost=1187015.87..6914612.59 rows=241423616 width=24)
                           ->  Subquery Scan on "*SELECT* 1"  (cost=1187015.87..4488192.27 rows=144531000 width=36)
                                 ->  Nested Loop  (cost=1187015.87..3042882.27 rows=144531000 width=36)
                                       ->  Unique  (cost=1187015.87..1221789.91 rows=1445310 width=4)
                                             ->  Sort  (cost=1187015.87..1204402.89 rows=6954808 width=4)
                                                   Sort Key: a.processed_datum
                                                   ->  Seq Scan on a (cost=0.00..111352.57 rows=6954808 width=4)
                                                         Filter: (processed_datum IS NOT NULL)
                                       ->  Materialize  (cost=0.00..2.01 rows=100 width=32)
                                             ->  Result  (cost=0.00..0.51 rows=100 width=0)
                           ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..2426420.32 rows=96892616 width=7)
                                 ->  Seq Scan on b b_1  (cost=0.00..1457494.16 rows=96892616 width=7)

EDIT: After bumping work_mem, shared_buffers, and effective_cache_size to 512MB, 2GB, and 4GB respectively, here's an EXPLAIN ANALYZE:

Insert on b  (cost=44887364.38..48178884.98 rows=146154400 width=36) (actual time=3914166.825..3914166.825 rows=0 loops=1)
  ->  Subquery Scan on "*SELECT*"  (cost=44887364.38..48178884.98 rows=146154400 width=36) (actual time=3685048.686..3788573.419 rows=5693368 loops=1)
        ->  SetOp Except  (cost=44887364.38..46717340.98 rows=146154400 width=24) (actual time=3685048.682..3774876.094 rows=5693368 loops=1)
              ->  Sort  (cost=44887364.38..45497356.58 rows=243996880 width=24) (actual time=3217525.970..3615432.663 rows=127088841 loops=1)
                    Sort Key: "*SELECT* 1".locale, "*SELECT* 1".datum
                    Sort Method: external merge  Disk: 2733064kB
                    ->  Append  (cost=128839.10..5891963.33 rows=243996880 width=24) (actual time=20301.567..785480.666 rows=127088841 loops=1)
                          ->  Subquery Scan on "*SELECT* 1"  (cost=128839.10..3446545.73 rows=146154400 width=36) (actual time=20301.565..166376.863 rows=22130368 loops=1)
                                ->  Nested Loop  (cost=128839.10..1985001.73 rows=146154400 width=36) (actual time=20301.561..119491.552 rows=22130368 loops=1)
                                      ->  HashAggregate  (cost=128839.10..143454.54 rows=1461544 width=4) (actual time=20301.492..30209.211 rows=5532592 loops=1)
                                            ->  Seq Scan on a  (cost=0.00..111421.99 rows=6966842 width=4) (actual time=0.047..8712.474 rows=6963036 loops=1)
                                                  Filter: (processed_datum IS NOT NULL)
                                                  Rows Removed by Filter: 46118
                                      ->  Materialize  (cost=0.00..2.01 rows=100 width=32) (actual time=0.001..0.005 rows=4 loops=5532592)
                                            ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.044..0.052 rows=4 loops=1)
                          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..2445417.60 rows=97842480 width=7) (actual time=93.293..356716.139 rows=104958473 loops=1)
                                ->  Seq Scan on b b_1  (cost=0.00..1466992.80 rows=97842480 width=7) (actual time=93.288..134378.648 rows=104958473 loops=1)
Trigger for constraint b_datum_fkey: time=112020.669 calls=5693368
Trigger for constraint b_processed_fkey: time=12847.165 calls=5693368
Trigger for constraint b_detail_fkey: time=13007.502 calls=5693368
Total runtime: 4072066.380 ms

Total elapsed time still over an hour. I don't 100% understand what these numbers mean, but it looks like the bulk of the cost is in the top-level Sort operation, and this:

 Sort Method: external merge  Disk: 2733064kB

... doesn't make any sense at all. 50 million 2-tuples, each consisting of a four-byte integer and a two-to-four-character string, should take order of 400 megabytes to serialize onto disk for sorting purposes. I can imagine where there might be a factor of two overhead, but not seven.

EDIT.2: Here's EXPLAIN (ANALYZE, BUFFERS) for the "anti join" technique suggested by Clockwork-Muse. It's faster, but only by about 27% — not nearly enough to call this solved ("solved" would be a construct that runs in less than a minute at this scale; ideally, less than a second). Note that since zero rows were inserted, this falsifies Clodoaldo Neto's suggestion that the bottleneck is index updates. It looks to me like it's still spending most of its time in external sorts and merges.

Insert on b  (cost=42686194.35..45105977.45 rows=115655360 width=36) (actual time=2948649.994..2948649.994 rows=0 loops=1)
  Buffers: shared hit=103168 read=451882, temp read=277204 written=289345
  ->  Merge Anti Join  (cost=42686194.35..45105977.45 rows=115655360 width=36) (actual time=2948649.984..2948649.984 rows=0 loops=1)
        Merge Cond: (((unnest('{us,uk,cn,jp}'::text[])) = exc.locale) AND (a.processed_datum = exc.datum))
        Buffers: shared hit=103168 read=451882, temp read=277204 written=289345
        ->  Sort  (cost=25512243.57..25873666.57 rows=144569200 width=36) (actual time=494618.787..546275.100 rows=22130368 loops=1)
              Sort Key: (unnest('{us,uk,cn,jp}'::text[])), a.processed_datum
              Sort Method: external merge  Disk: 367752kB
              Buffers: shared hit=4 read=41286, temp read=45973 written=45973
              ->  Nested Loop  (cost=128828.23..1964858.83 rows=144569200 width=36) (actual time=21022.002..122392.505 rows=22130368 loops=1)
                    Buffers: shared read=41286
                    ->  HashAggregate  (cost=128828.23..143285.15 rows=1445692 width=4) (actual time=21021.642..30776.200 rows=5532592 loops=1)
                          Buffers: shared read=41286
                          ->  Seq Scan on a  (cost=0.00..111403.46 rows=6969909 width=4) (actual time=0.057..8959.278 rows=6963036 loops=1)
                                Filter: (processed_datum IS NOT NULL)
                                Rows Removed by Filter: 46118
                                Buffers: shared read=41286
                    ->  Materialize  (cost=0.00..2.01 rows=100 width=32) (actual time=0.001..0.005 rows=4 loops=5532592)
                          ->  Result  (cost=0.00..0.51 rows=100 width=0) (actual time=0.338..0.344 rows=4 loops=1)
        ->  Materialize  (cost=17173950.78..17704603.74 rows=106130592 width=7) (actual time=1769582.325..2204210.714 rows=105119249 loops=1)
              Buffers: shared hit=103164 read=410596, temp read=231231 written=243372
              ->  Sort  (cost=17173950.78..17439277.26 rows=106130592 width=7) (actual time=1769582.298..1978474.757 rows=105119249 loops=1)
                    Sort Key: exc.locale, exc.datum
                    Sort Method: external merge  Disk: 1946944kB
                    Buffers: shared hit=103164 read=410596, temp read=231231 written=243372
                    ->  Seq Scan on b exc  (cost=0.00..1575065.92 rows=106130592 width=7) (actual time=91.383..158841.821 rows=110651841 loops=1)
                          Buffers: shared hit=103164 read=410596
Total runtime: 2949399.213 ms

Upvotes: 0

Answers (3)

joop

Reputation: 4503

CREATE TABLE a
   ( datum            INTEGER NOT NULL -- REFERENCES datum_details(id)
   , processed_datum  INTEGER -- REFERENCES datum_details(id)
   -- other columns
   , PRIMARY KEY(datum)
);
CREATE INDEX a__processed ON a(processed_datum);

CREATE TABLE b (
   locale           TEXT NOT NULL
   , datum            INTEGER NOT NULL -- REFERENCES datum_details(id)
   -- other columns
   , PRIMARY KEY (locale, datum)
);

-- this index  might help more than the original single-field index
CREATE UNIQUE INDEX b__datum ON b(datum, locale);

-- some test data
INSERT INTO a(datum, processed_datum)
SELECT gs, gs * 5
FROM generate_series(1,1000) gs
        ;

UPDATE a SET processed_datum = NULL WHERE random() < 0.5 ;

-- some test data
INSERT INTO b (datum,locale)
SELECT gs, 'us'
FROM generate_series(1,1000) gs
        ;
UPDATE b SET locale = 'uk' WHERE random() < 0.5;
UPDATE b SET locale = 'cn' WHERE random() < 0.4;
UPDATE b SET locale = 'jp' WHERE random() < 0.3;

VACUUM ANALYZE a;
VACUUM ANALYZE b;

-- EXPLAIN ANALYZE
INSERT INTO b (locale, datum)
   SELECT y.locale, x.datum FROM
      (SELECT DISTINCT processed_datum  AS datum
        FROM a
          WHERE a.processed_datum IS NOT NULL
        ) AS x
      , (SELECT UNNEST(ARRAY['us', 'uk', 'cn', 'jp']) AS locale
        ) AS y
      WHERE NOT EXISTS (
                SELECT * FROM b nx
                WHERE nx.locale = y.locale
                AND nx.datum = x.datum
                )
        ;

Note: except for the index, this is very similar to @clockwork-muse 's answer.

Upvotes: 0

Clockwork-Muse

Reputation: 13056

You might be able to do better with an exception join, essentially (DB2 has this as an actual join type, everybody else has to use this form):

INSERT INTO b (locale, datum)
SELECT y.locale, x.datum 
FROM (SELECT DISTINCT processed_datum 
      FROM a
      WHERE processed_datum IS NOT NULL) x(datum)
CROSS JOIN (SELECT UNNEST(ARRAY['us', 'uk', 'cn', 'jp'])) y(locale)
LEFT JOIN b Exception
       ON Exception.locale = y.locale
          AND Exception.datum = x.datum
WHERE Exception.locale IS NULL

Or perhaps a related form:

INSERT INTO b (locale, datum)
SELECT locale, datum 
FROM (SELECT DISTINCT processed_datum 
      FROM a
      WHERE processed_datum IS NOT NULL) x(datum)
CROSS JOIN (SELECT UNNEST(ARRAY['us', 'uk', 'cn', 'jp'])) y(locale)
WHERE NOT EXISTS(SELECT 1
                 FROM b Exception
                 WHERE Exception.locale = y.locale
                       AND Exception.datum = x.datum)

Upvotes: 2

Clodoaldo Neto

Reputation: 125244

The not exists variant suggested by Clockwork-Muse frequently outperforms the others.

But I guess the bulk of the cost lies in the primary key and b__datum indexes building concurrently to the insertion. It is cheaper to build the indexes at once. Try dropping the primary key and the index before the insertion

alter table b drop constraint b_pkey;
drop index b__datum;

Then do the insert building the indexes only after that:

alter table b add primary key (locale, datum);
create index b__datum on b(datum);

Do it inside a transaction so you can rollback if it fails with a duplicated primary key

begin;
alter table b drop constraint b_pkey;
drop index b__datum;

insert ...

alter table b add primary key (locale, datum);
create index b__datum on b(datum);
commit; -- or rollback if it fails

Upvotes: 1

there must be a faster way (batch insert if not already present, postgres)

Answers (3)

Related Questions