Reputation: 879
THE PROBLEM:
SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 64500)
LIMIT 1;
Execution time: 62 ms.
SELECT new_filename
FROM tmp2_import_lightnings_filenames
WHERE new_filename
NOT IN (SELECT filename FROM service.import_lightnings_filenames LIMIT 65000)
LIMIT 1;
Execution time: 4.742 sec.
(All LIMITS are only for test).
Enormous lag! And it increases exponentially.
TABLES:
CREATE TABLE public.tmp2_import_lightnings_filenames (
new_filename VARCHAR(63) NOT NULL,
CONSTRAINT tmp2_import_lightnings_filenames_pkey PRIMARY KEY(new_filename)
) WITHOUT OIDS;
Table size: 7304 strings
Data example: /xml/2012-07-13/01/01-24.xml
CREATE TABLE service.import_lightnings_filenames (
id SERIAL,
filename VARCHAR(63) NOT NULL,
imported BOOLEAN DEFAULT false,
strokes_num INTEGER,
CONSTRAINT import_lightnings_filenames_pkey PRIMARY KEY(id)
) WITHOUT OIDS;
CREATE UNIQUE INDEX import_lightnings_filenames_idx
ON service.import_lightnings_filenames
USING btree (filename COLLATE pg_catalog."default");
Table size: 70812 strings
Data example: 44;/xml/2012-05-26/12/12-18.xml;TRUE;NULL
QUERY PLANS:
Limit (cost=0.00..2108.11 rows=1 width=29) (actual time=240.183..240.183 rows=1 loops=1)
Buffers: shared hit=539, temp written=307
-> Seq Scan on tmp2_import_lightnings_filenames (cost=0.00..7698823.12 rows=3652 width=29) (actual time=240.181..240.181 rows=1 loops=1)
Filter: (NOT (SubPlan 1))
Buffers: shared hit=539, temp written=307
SubPlan 1
-> Materialize (cost=0.00..1946.82 rows=64500 width=29) (actual time=0.009..198.313 rows=64500 loops=1)
Buffers: shared hit=538, temp written=307
-> Limit (cost=0.00..1183.32 rows=64500 width=29) (actual time=0.005..113.196 rows=64500 loops=1)
Buffers: shared hit=538
-> Seq Scan on import_lightnings_filenames (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.004..42.418 rows=64500 loops=1)
Buffers: shared hit=538
Total runtime: 240.982 ms
Limit (cost=0.00..2125.03 rows=1 width=29) (actual time=30734.619..30734.619 rows=1 loops=1)
Buffers: shared hit=547, temp read=112258 written=669
-> Seq Scan on tmp2_import_lightnings_filenames (cost=0.00..7760626.00 rows=3652 width=29) (actual time=30734.617..30734.617 rows=1 loops=1)
Filter: (NOT (SubPlan 1))
Buffers: shared hit=547, temp read=112258 written=669
SubPlan 1
-> Materialize (cost=0.00..1962.49 rows=65000 width=29) (actual time=0.798..42.306 rows=64820 loops=363)
Buffers: shared hit=543, temp read=112258 written=669
-> Limit (cost=0.00..1192.49 rows=65000 width=29) (actual time=0.005..116.110 rows=65000 loops=1)
Buffers: shared hit=543
-> Seq Scan on import_lightnings_filenames (cost=0.00..1299.12 rows=70812 width=29) (actual time=0.003..43.804 rows=65000 loops=1)
Buffers: shared hit=543
Total runtime: 30735.267 ms
What do I do wrong?
Upvotes: 5
Views: 843
Reputation: 44250
-- SET work_mem=20000;
SET random_page_cost=1.1;
SET effective_cache_size=10000000;
Setting work_mem
to 1--20 MB will prefer hash tables (as long as they fit in core) This is effective for small to moderate sized queries.
Setting random_page_cost
lower will cause the query generator to prefer index scans once they are needed. This is the tripping point between the OP's first and second query. (but there, the index-scan stage is skipped, in favor of a seqscan)
The default value( =4) is way too high)
effective_cache_size
is an estimate of the amount of LRU buffering that the OS maintains. Set this as high as possible (without causing swapping)
Upvotes: 0
Reputation: 656804
The cause of the performance drop seems to be that you run out of work_mem
and the materialize
step starts swapping out to disk. I quote the manual here:
work_mem (integer)
[...] Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.
Emphasis mine. Verify this by raising the setting for work_mem
and running your queries once more. As @a_horse supplied in the comment, set it for the current session by calling:
set work_mem = '64MB';
You don't need your sysadmin for that. You can reset to the default within the session:
reset work_mem;
Setting will be gone with the end of the session. Change the setting in postgresql.conf
(and reload) for a permanent effect.
Many PostgreSQL packages are shipped with very conservative settings (default 1MB). It very much depends on your workload, but 16 MB on a machine with 4 GB or more would be a minimum in general. I use 64 MB on a dedicated db server with 12 GB RAM - with only few concurrent users.
You may have to do some general tuning of your setup. Here is a list of pointers for general performance optimization in the PostgreSQL Wiki. You will also find more information on work_mem
tuning following the links.
Apart from this, rewriting your query will probably speed things up, too. IN
subqueries with big lists tend to be the slowest choice in PostgreSQL.
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
LEFT JOIN (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
) x ON t.new_filename = x.filename
WHERE x.filename IS NULL;
Faster, in particular, with duplicates in service.import_lightnings_filenames
:
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
WHERE NOT EXISTS (
SELECT 1
FROM (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
) x
WHERE t.new_filename = x.filename
);
The same with CTE (probably not faster, but easier to read):
WITH x AS (
SELECT filename
FROM service.import_lightnings_filenames
LIMIT 65000
)
SELECT new_filename
FROM tmp2_import_lightnings_filenames t
WHERE NOT EXISTS (
SELECT 1
FROM x
WHERE t.new_filename = x.filename
);
Upvotes: 3