PyPingu
PyPingu

Reputation: 1747

PostgreSQL chooses not to use index despite improved performance

I had a DB in MySQL and am in the process of moving to PostgreSQL with a Django front-end.

I have a table of 650k-750k rows on which I perform the following query:

SELECT "MMG", "Gene", COUNT(*) FROM at_summary_typing WHERE "MMG" != '' GROUP BY "MMG", "Gene" ORDER BY COUNT(*);

In the MySQL this returns in ~0.5s. However when I switched to PostgreSQL the same query takes ~3s. I have put an index on MMG and Gene together to try and speed it up but when using EXPLAIN (analyse, buffers, verbose) I see the output shows the index is not used :

 Sort  (cost=59013.54..59053.36 rows=15927 width=14) (actual time=2880.222..2885.475 rows=39314 loops=1)
   Output: "MMG", "Gene", (count(*))
   Sort Key: (count(*))
   Sort Method: external merge  Disk: 3280kB
   Buffers: shared hit=16093 read=11482, temp read=2230 written=2230
   ->  GroupAggregate  (cost=55915.50..57901.90 rows=15927 width=14) (actual time=2179.809..2861.679 rows=39314 loops=1)
         Output: "MMG", "Gene", count(*)
         Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
         ->  Sort  (cost=55915.50..56372.29 rows=182713 width=14) (actual time=2179.782..2830.232 rows=180657 loops=1)
               Output: "MMG", "Gene"
               Sort Key: at_summary_typing."MMG", at_summary_typing."Gene"
               Sort Method: external merge  Disk: 8168kB
               Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
               ->  Seq Scan on public.at_summary_typing  (cost=0.00..36821.60 rows=182713 width=14) (actual time=0.010..224.658 rows=180657 loops=1)
                     Output: "MMG", "Gene"
                     Filter: ((at_summary_typing."MMG")::text <> ''::text)
                     Rows Removed by Filter: 559071
                     Buffers: shared hit=16093 read=11482
 Total runtime: 2888.804 ms

After some searching I found that I could force the use of the index by setting SET enable_seqscan = OFF; and the EXPLAIN now shows the following :

Sort  (cost=1181591.18..1181631.00 rows=15927 width=14) (actual time=555.546..560.839 rows=39314 loops=1)
   Output: "MMG", "Gene", (count(*))
   Sort Key: (count(*))
   Sort Method: external merge  Disk: 3280kB
   Buffers: shared hit=173219 read=87094 written=7, temp read=411 written=411
   ->  GroupAggregate  (cost=0.42..1180479.54 rows=15927 width=14) (actual time=247.546..533.202 rows=39314 loops=1)
         Output: "MMG", "Gene", count(*)
         Buffers: shared hit=173219 read=87094 written=7
         ->  Index Only Scan using mm_gene_idx on public.at_summary_typing  (cost=0.42..1178949.93 rows=182713 width=14) (actual time=247.533..497.771 rows=180657 loops=1)
               Output: "MMG", "Gene"
               Filter: ((at_summary_typing."MMG")::text <> ''::text)
               Rows Removed by Filter: 559071
               Heap Fetches: 739728
               Buffers: shared hit=173219 read=87094 written=7
 Total runtime: 562.735 ms

Performance now comparable with the MySQL. The problem is that I understand that setting this is bad practice and that I should try and find a way to improve my query/encourage the use of the index automatically. However I'm very inexperienced with PostgreSQL and cannot work out how or why it is choosing to use a Seq Scan over an Index Scan in the first place.

Upvotes: 2

Views: 389

Answers (1)

user330315
user330315

Reputation:

why it is choosing to use a Seq Scan over an Index Scan in the first place

Because the seq scan is actually twice as fast as the index scan (224ms vs. 497ms) despite the fact that the index was nearly completely in the cache, but the table was not.

So choosing the seq scan was the right thing to do.

The bottleneck in the first plan is the sorting and grouping that needs to be done on disk.

The better strategy would be to increase work_mem to something more realistic than the really small default of 4MB. You might want to start with something like 16MB, by running

set work_mem=16MB;

before running your query. If that doesn't remove the "Sort Method: external merge Disk" steps, increase it work_mem further.

By increasing the work_mem it also is possible that Postgres switches to a hash aggregate instead of the sorting that it currently does which will probably be faster anyway (but isn't feasible if not enough memory is available)

Once you find a good value, you might want to make that permanent by putting the new value into postgresql.conf

Don't set this too high: that memory may be requested multiple times for each query.


If your where condition is static, you could also create a partial index matching that criteria:

create index on at_summary_typing ("MMG", "Gene") 
where "MMG" <> '';

Don't forget to analyze the table to update the statistics.

Upvotes: 3

Related Questions