optimizing a nested query to avoid tremendeous performance hit

Question

wizards out there. I'm breaking my head on the following sql query's odd performance. I have the following query which is composed of following sql code (running on pg):

WITH temp_song_id_with_all_stemmed_words AS 
    (SELECT song_id FROM stemmed_words
    WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
    GROUP BY song_id
    HAVING COUNT(*)=5)
SELECT *
FROM words
WHERE song_id IN(
    SELECT song_id
    FROM temp_song_id_with_all_stemmed_words
)
ORDER BY song_id, global_position;

It takes about 10 seconds to calculate with amount of data i put in tables. I tried various approaches in order to optimize this query:

putting the "with" clause inside the query itself
putting the "with" clause inside a temp table and then querying on it
indexing every possible column

But it all was to no avail. Calculation time is still at the range of about 10 seconds (assuming everything is already cached in memory..., it its not it could take even up to a minute)

Then I noticed that when I split the query to its composing parts things behave entirely different:

SELECT song_id FROM stemmed_words
        WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
        GROUP BY song_id
        HAVING COUNT(*)=5

this query takes about 500 ms to calculate tops, and gives 3 id's as a result

when i use these results in order to calculate the enclosing query::

SELECT *
FROM words
WHERE song_id IN(337409,328981,304231)
ORDER BY song_id, global_position;

it takes about 30ms to complete

I have no idea what goes on under the hood here, but i would imagine that a proper sql optimizer would do just what i did above.

when i look at the explain output i see the following:

--UPDATE-- entered explain(analyze,verbose) instead just explain

"Merge Join  (cost=20253.29..706336.00 rows=6312654 width=21) (actual time=240731.380..259453.350 rows=356 loops=1)"
"  Output: words.song_id, words.word, words.global_position, words.line_number, words.verse_number"
"  Merge Cond: (words.song_id = temp_song_id_with_all_stemmed_words.song_id)"
"  CTE temp_song_id_with_all_stemmed_words"
"    ->  HashAggregate  (cost=19799.62..19936.11 rows=13649 width=4) (actual time=43.168..44.916 rows=3 loops=1)"
"          Output: stemmed_words.song_id"
"          Group Key: stemmed_words.song_id"
"          Filter: (count(*) = 5)"
"          Rows Removed by Filter: 17181"
"          ->  Bitmap Heap Scan on public.stemmed_words  (cost=474.02..19714.55 rows=17014 width=4) (actual time=10.254..31.899 rows=21099 loops=1)"
"                Output: stemmed_words.stemmed_word, stemmed_words.song_id"
"                Recheck Cond: (stemmed_words.stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
"                Heap Blocks: exact=12239"
"                ->  Bitmap Index Scan on stemmed_words_pkey  (cost=0.00..469.76 rows=17014 width=0) (actual time=6.052..6.052 rows=21099 loops=1)"
"                      Index Cond: (stemmed_words.stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
"  ->  Index Scan using words_song_id_global_position_idx on public.words  (cost=0.44..653025.11 rows=12625308 width=21) (actual time=0.117..257820.366 rows=7860598 loops=1)"
"        Output: words.song_id, words.word, words.global_position, words.line_number, words.verse_number"
"  ->  Sort  (cost=316.75..317.25 rows=200 width=4) (actual time=44.953..45.017 rows=274 loops=1)"
"        Output: temp_song_id_with_all_stemmed_words.song_id"
"        Sort Key: temp_song_id_with_all_stemmed_words.song_id"
"        Sort Method: quicksort  Memory: 25kB"
"        ->  HashAggregate  (cost=307.10..309.10 rows=200 width=4) (actual time=44.928..44.929 rows=3 loops=1)"
"              Output: temp_song_id_with_all_stemmed_words.song_id"
"              Group Key: temp_song_id_with_all_stemmed_words.song_id"
"              ->  CTE Scan on temp_song_id_with_all_stemmed_words  (cost=0.00..272.98 rows=13649 width=4) (actual time=43.171..44.921 rows=3 loops=1)"
"                    Output: temp_song_id_with_all_stemmed_words.song_id"
"Planning time: 0.481 ms"
"Execution time: 259454.102 ms"

but honestly i do not understand what is going on there... looks like chinese to me.

so to summerize my question: I have a feeling that i can optimize this as a single query, instead of splitting it into two seperate ones.

why is it taking it so long to complete in its current form?
how can i optimize it with out splitting the query into two seperate ones like i did above?

Laurenz Albe · Accepted Answer

The problem here is that PostgreSQL cannot correctly estimate the number of rows the CTE (= WITH query) will return.

PostgreSQL estimates 13649 rows, while you tell us the correct number is 3.

I'd expect good results with your second technique (putting the "with" clause inside a temp table and then querying on it) as long as you ANALYZE the temporary table between these two operations, because then PostgreSQL knows exactly how many values it has to deal with.

optimizing a nested query to avoid tremendeous performance hit

Answers (1)

Related Questions