Alexander Palamarchuk
Alexander Palamarchuk

Reputation: 879

Sharp drop in performance of SELECT .. WHERE NOT IN (SELECT ..) with strings

THE PROBLEM:

SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
    NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 64500)
LIMIT 1;

Execution time: 62 ms.

SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
    NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 65000)
LIMIT 1;

Execution time: 4.742 sec.

(All LIMITS are only for test).

Enormous lag! And it increases exponentially.

TABLES:

CREATE TABLE public.tmp2_import_lightnings_filenames (
  new_filename VARCHAR(63) NOT NULL, 
  CONSTRAINT tmp2_import_lightnings_filenames_pkey PRIMARY KEY(new_filename)
) WITHOUT OIDS;

Table size: 7304 strings

Data example: /xml/2012-07-13/01/01-24.xml

CREATE TABLE service.import_lightnings_filenames (
  id SERIAL, 
  filename VARCHAR(63) NOT NULL, 
  imported BOOLEAN DEFAULT false, 
  strokes_num INTEGER, 
  CONSTRAINT import_lightnings_filenames_pkey PRIMARY KEY(id)
) WITHOUT OIDS;

CREATE UNIQUE INDEX import_lightnings_filenames_idx
ON service.import_lightnings_filenames
USING btree (filename COLLATE pg_catalog."default");

Table size: 70812 strings

Data example: 44;/xml/2012-05-26/12/12-18.xml;TRUE;NULL

QUERY PLANS:

 Limit  (cost=0.00..2108.11 rows=1 width=29) (actual time=240.183..240.183 rows=1 loops=1)
   Buffers: shared hit=539, temp written=307
   ->  Seq Scan on tmp2_import_lightnings_filenames  (cost=0.00..7698823.12 rows=3652 width=29) (actual time=240.181..240.181 rows=1 loops=1)
         Filter: (NOT (SubPlan 1))
         Buffers: shared hit=539, temp written=307
         SubPlan 1
           ->  Materialize  (cost=0.00..1946.82 rows=64500 width=29) (actual time=0.009..198.313 rows=64500 loops=1)
                 Buffers: shared hit=538, temp written=307
                 ->  Limit  (cost=0.00..1183.32 rows=64500 width=29) (actual time=0.005..113.196 rows=64500 loops=1)
                       Buffers: shared hit=538
                       ->  Seq Scan on import_lightnings_filenames  (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.004..42.418 rows=64500 loops=1)
                             Buffers: shared hit=538
 Total runtime: 240.982 ms



  Limit  (cost=0.00..2125.03 rows=1 width=29) (actual time=30734.619..30734.619 rows=1 loops=1)
   Buffers: shared hit=547, temp read=112258 written=669
   ->  Seq Scan on tmp2_import_lightnings_filenames  (cost=0.00..7760626.00 rows=3652 width=29) (actual time=30734.617..30734.617 rows=1 loops=1)
         Filter: (NOT (SubPlan 1))
         Buffers: shared hit=547, temp read=112258 written=669
         SubPlan 1
           ->  Materialize  (cost=0.00..1962.49 rows=65000 width=29) (actual time=0.798..42.306 rows=64820 loops=363)
                 Buffers: shared hit=543, temp read=112258 written=669
                 ->  Limit  (cost=0.00..1192.49 rows=65000 width=29) (actual time=0.005..116.110 rows=65000 loops=1)
                       Buffers: shared hit=543
                       ->  Seq Scan on import_lightnings_filenames  (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.003..43.804 rows=65000 loops=1)
                             Buffers: shared hit=543
 Total runtime: 30735.267 ms

What do I do wrong?

Upvotes: 5

Views: 843

Answers (2)

wildplasser
wildplasser

Reputation: 44250

-- SET work_mem=20000;
SET random_page_cost=1.1;
SET effective_cache_size=10000000;

Setting work_mem to 1--20 MB will prefer hash tables (as long as they fit in core) This is effective for small to moderate sized queries.

Setting random_page_cost lower will cause the query generator to prefer index scans once they are needed. This is the tripping point between the OP's first and second query. (but there, the index-scan stage is skipped, in favor of a seqscan) The default value( =4) is way too high)

effective_cache_size is an estimate of the amount of LRU buffering that the OS maintains. Set this as high as possible (without causing swapping)

Upvotes: 0

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656804

The cause of the performance drop seems to be that you run out of work_mem and the materialize step starts swapping out to disk. I quote the manual here:

work_mem (integer)
[...] Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

Emphasis mine. Verify this by raising the setting for work_mem and running your queries once more. As @a_horse supplied in the comment, set it for the current session by calling:

set work_mem = '64MB';

You don't need your sysadmin for that. You can reset to the default within the session:

reset work_mem;

Setting will be gone with the end of the session. Change the setting in postgresql.conf (and reload) for a permanent effect.

Many PostgreSQL packages are shipped with very conservative settings (default 1MB). It very much depends on your workload, but 16 MB on a machine with 4 GB or more would be a minimum in general. I use 64 MB on a dedicated db server with 12 GB RAM - with only few concurrent users.

You may have to do some general tuning of your setup. Here is a list of pointers for general performance optimization in the PostgreSQL Wiki. You will also find more information on work_mem tuning following the links.


Apart from this, rewriting your query will probably speed things up, too. IN subqueries with big lists tend to be the slowest choice in PostgreSQL.

LEFT JOIN / IS NULL

SELECT new_filename
FROM   tmp2_import_lightnings_filenames t
LEFT   JOIN (
    SELECT filename
    FROM   service.import_lightnings_filenames
    LIMIT  65000
    ) x ON t.new_filename = x.filename 
WHERE  x.filename IS NULL;

NOT EXISTS

Faster, in particular, with duplicates in service.import_lightnings_filenames:

SELECT new_filename
FROM   tmp2_import_lightnings_filenames t
WHERE  NOT EXISTS (
    SELECT 1
    FROM (
        SELECT filename
        FROM   service.import_lightnings_filenames
        LIMIT  65000
        ) x
    WHERE t.new_filename = x.filename 
    );

The same with CTE (probably not faster, but easier to read):

WITH x AS (
    SELECT filename
    FROM   service.import_lightnings_filenames
    LIMIT  65000
    )
SELECT new_filename
FROM   tmp2_import_lightnings_filenames t
WHERE  NOT EXISTS (
    SELECT 1
    FROM   x
    WHERE  t.new_filename = x.filename 
    );

Upvotes: 3

Related Questions