user2945234
user2945234

Reputation: 428

Why does sqlite rescan the table with distinct?

I need to get the distinct elements of ref and alt. I have a very efficient query until I add distinct and it rescans the base table? Since I have a temp table shouldn't it simply use that as a source of data?

   sqlite> explain query plan
       ...> select t1.ref, t1.alt from (SELECT * from Sample_szes where str_id 
        = 'STR_832206') as t1;


 selectid|order|from|detail
    1|0|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx ( . 
     str_id=?) (~10 rows)
     1|1|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx 
     (str_id=?) (~10 rows)
      2|0|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx 
      (str_id=?) (~10 rows)
     2|1|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx 
    (str_id=?) (~2 rows)
    2|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?) (~2 rows)
    0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)

Add distinct and it rescans the large base table.

sqlite> explain query plan
 ...> select distinct t1.ref, t1.alt from (SELECT * from Sample_szes 
 where str_id = 'STR_832206') as t1;

selectid|order|from|detail
2|0|0|SCAN TABLE vcfBase AS base (~1000000 rows)
2|1|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx 
(str_id=?) (~10 rows)
3|0|0|SCAN TABLE vcfBase AS base (~1000000 rows)
3|1|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx (str_id=?) (~2 rows)
3|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?) (~2 rows)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
0|0|0|SCAN SUBQUERY 1 (~1400000 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT

Upvotes: 0

Views: 81

Answers (2)

MikeT
MikeT

Reputation: 56938

You should create a composite index for the ref and alt columns. The index would then be used. Otherwise the temporary B-TREE (index) is created which requires an entire scan to sort the data for the index.

I believe the explanation is as per :-

If a SELECT query contains an ORDER BY, GROUP BY or DISTINCT clause, SQLite may need to use a temporary b-tree structure to sort the output rows. Or, it might use an index. Using an index is almost always much more efficient than performing a sort.

If a temporary b-tree is required, a record is added to the EXPLAIN QUERY PLAN output with the "detail" field set to a string value of the form "USE TEMP B-TREE FOR xxx", where xxx is one of "ORDER BY", "GROUP BY" or "DISTINCT". For example:

sqlite> EXPLAIN QUERY PLAN SELECT c, d FROM t2 ORDER BY c;
QUERY PLAN
|--SCAN TABLE t2
`--USE TEMP B-TREE FOR ORDER BY

In this case using the temporary b-tree can be avoided by creating an index on t2(c), as follows:

sqlite> CREATE INDEX i4 ON t2(c);
sqlite> EXPLAIN QUERY PLAN SELECT c, d FROM t2 ORDER BY c; 
QUERY PLAN
`--SCAN TABLE t2 USING INDEX i4

EXPLAIN QUERY PLAN - 1.2. Temporary Sorting B-Trees

Upvotes: 1

user2945234
user2945234

Reputation: 428

I think I might have found the answer. on my mac I have the following version of sqlite

SQLite version 3.19.3 2017-06-27 16:48:08

    sqlite> explain query plan
   ...> select distinct t1.ref, t1.alt from (SELECT * from Sample_szes where str_id = 'STR_832206') as t1;
2|0|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx (str_id=?)
2|1|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx (str_id=?)
3|0|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx (str_id=?)
3|1|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx (str_id=?)
3|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR DISTINCT

Upvotes: 0

Related Questions