John Lotacs
John Lotacs

Reputation: 1254

Slow SQL Query on indexed columns, multi-column number

I have a table that is comprised of 6 numbers as the primary key

CREATE TABLE table1 ( num1 decimal, num2 int, num3 int, num4 bigint, num5 bigint, num6 bigint,
PRIMARY KEY (num1, num2, num3, num4, num5, num6))

I need to access the table in sorted order and often times I have a need to query the table to find the next N large numbers in order and their associated data.

So the query I wrote was something like this

SELECT * FROM table1 WHERE  
 num1 >? OR (  
 (num1 == ? AND num2 > ?) OR (  
 (num1 == ? AND num2 == ? AND num3 > ?) OR (  
 (num1 == ? AND num2 == ? AND num3 == ? AND num4 > ? OR (  
 (num1 == ? AND num2 == ? AND num3 == ? AND num4 == ? AND num5 > ?) OR (  
 (num1 == ? AND num2 == ? AND num3 == ?  
 AND num4 == ? AND num5 == ? AND num6 > ?)))))) ORDER BY num1, num2, num3, num4, num5, num6  
 LIMIT ?;

This was the best way I could see to find the next largest key, and this does query in the order of the index however....query takes a few seconds, which is something that I'm not to fond of.

Is there any way to improve the performance? This takes a few seconds to execute on a table of 10million rows and I need it to execute more on the order of 100ms.

Query Plan:

"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1>?) (~250000 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2>?) (~2 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3>?) (~2 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3=? AND num4>?) (~2 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3=? AND num4=? AND num5>?) (~1 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3=? AND num4=? AND num5=? AND num6>?) (~1 rows)"
"USE TEMP B-TREE FOR ORDER BY"

Edit:

Why is this not possible? I literally want to get things in the INDEXED ORDER, the same order generated by the ORDER BY keyword?

Upvotes: 1

Views: 652

Answers (3)

Didier Spezia
Didier Spezia

Reputation: 73226

Contrary to other more sophisticated RDBMS, sqlite has a rule-based query optimizer, meaning that the execution plan mostly depends on the way the query is written (and the order of the clauses). It makes the optimizer quite predictable, and if you know how sqlite generates execution plans, you can take benefit of this predictability to solve your issue.

A first idea is to note that the various clauses like (num1>?) or (num1=? and num2>?) are producing disjoint results, and that these results are naturally sorted between each others. If the query is divided in subqueries (each of them handling a part of the condition) producing sorted results, then the concatenation of all the result-sets is also sorted, if the subqueries are executed in the correct order.

For example, consider the following queries:

select * from table1 where num1=? and num2>? order by num1,num2
select * from table1 where num1>? order by num1,num2

The two result sets produced by these queries are disjoint, and the rows of the first result-set are always ordered before the rows of the second result-set.

The second idea is to understand how sqlite handles the LIMIT clause. Actually, it declares a counter at the begining of the query, and decrement and test this counter at each selected row, so it can stop a query early.

For instance, consider the following query:

.explain
explain select * from (
   select * from table1 where num1=? and num2>?
   union all
   select * from table1 where num1>?
) limit 10;

sqlite will evaluate the subqueries in the order specified in the query. If the first subquery returns more than 10 rows, the second subquery will not even be executed. It can be easily checked by displaying the plan:

addr  opcode         p1    p2    p3    p4             p5  comment      
----  -------------  ----  ----  ----  -------------  --  -------------
0     Trace          0     0     0                    00               
1     Integer        10    1     0                    00               
2     Variable       1     2     2                    00               
3     Goto           0     44    0                    00               
4     OpenRead       3     3     0     keyinfo(6,BINARY,BINARY)  00               
5     SCopy          2     4     0                    00               
6     IsNull         4     23    0                    00               
7     SCopy          3     5     0                    00               
8     IsNull         5     23    0                    00               
9     Affinity       4     2     0     cd             00               
10    SeekGt         3     23    4     2              00               
11    IdxGE          3     23    4     1              01               
12    Column         3     1     6                    00               
13    IsNull         6     22    0                    00               
14    Column         3     0     7                    00               
15    Column         3     1     8                    00               
16    Column         3     2     9                    00               
17    Column         3     3     10                   00               
18    Column         3     4     11                   00               
19    Column         3     5     12                   00               
20    ResultRow      7     6     0                    00               
21    IfZero         1     23    -1                   00               
22    Next           3     11    0                    00               
23    Close          3     0     0                    00               
24    IfZero         1     43    0                    00               
25    Variable       3     13    1                    00               
26    OpenRead       4     3     0     keyinfo(6,BINARY,BINARY)  00               
27    SCopy          13    14    0                    00               
28    IsNull         14    42    0                    00               
29    Affinity       14    1     0     c              00               
30    SeekGt         4     42    14    1              00               
31    Column         4     0     6                    00               
32    IsNull         6     41    0                    00               
33    Column         4     0     7                    00               
34    Column         4     1     8                    00               
35    Column         4     2     9                    00               
36    Column         4     3     10                   00               
37    Column         4     4     11                   00               
38    Column         4     5     12                   00               
39    ResultRow      7     6     0                    00               
40    IfZero         1     42    -1                   00               
41    Next           4     31    0                    00               
42    Close          4     0     0                    00               
43    Halt           0     0     0                    00               
44    Transaction    0     0     0                    00               
45    VerifyCookie   0     3     0                    00               
46    TableLock      0     2     0     table1         00               
47    Goto           0     4     0                    00               

The counter is declared step 1, and decremented/tested at steps 21, 24, 40.

By combining these two remarks, we can propose a query which is not pretty, but will produce an efficient execution plan:

SELECT * FROM (
  SELECT * FROM ( SELECT * FROM table1
                  WHERE num1 == ? AND num2 == ? AND num3 == ? AND num4 == ? AND num5 == ? AND num6 > ?
                  ORDER BY num1, num2, num3, num4, num5, num6 )
  UNION ALL
  SELECT * FROM ( SELECT * FROM table1
                  WHERE num1 == ? AND num2 == ? AND num3 == ? AND num4 == ? AND num5 > ?
                  ORDER BY num1, num2, num3, num4, num5, num6 )
  UNION ALL
  SELECT * FROM ( SELECT * FROM table1
                  WHERE num1 == ? AND num2 == ? AND num3 == ? AND num4 > ?
                  ORDER BY num1, num2, num3, num4, num5, num6 )
  UNION ALL
  SELECT * FROM ( SELECT * FROM table1
                  WHERE num1 == ? AND num2 == ? AND num3 > ?
                  ORDER BY num1, num2, num3, num4, num5, num6 )
  UNION ALL
  SELECT * FROM ( SELECT * FROM table1
                  WHERE num1 == ? AND num2 > ?
                  ORDER BY num1, num2, num3, num4, num5, num6 )
  UNION ALL
  SELECT * FROM ( SELECT * FROM table1
                  WHERE num1 > ?
                  ORDER BY num1, num2, num3, num4, num5, num6 )
) LIMIT ?;

Note that because the "order by" clause is not required in the outer query, there is no need for sqlite to execute all the subqueries. So it can just stop when it has the correct number of rows. The order of the subqueries is critical.

The second level inner subqueries are needed because it is not possible to use "order by" before "union all". They are optimized away by sqlite, so it is not an issue.

On a dummy table containing 777K rows, the initial query costs:

strace -c -eread,lseek sqlite3 toto.db < q1.sql
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.57    0.001586           0     18556           read
 36.43    0.000909           0     18544           lseek
------ ----------- ----------- --------- --------- ----------------
100.00    0.002495                 37100           total

while mine only costs:

strace -c -eread,lseek sqlite3 toto.db < q3.sql
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  -nan    0.000000           0        18           read
  -nan    0.000000           0         8           lseek
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    26           total

Upvotes: 1

filimonov
filimonov

Reputation: 1734

I think with such cases query optimizer should deal, but in aqlite it's very simple, so it's really would be better to change table structure as @cyroxx wrote.

Other thoughts: you can also try to rewrite query in some other way, and may be optimizer will understand what is needed. For example you can try:

SELECT * FROM table1 WHERE
    num1 > 1   OR
  ( num1 = 1   AND ( num2 > 2 OR
                   ( num2 = 2 AND ( num3 > 3 OR
                                  ( num3 = 3 AND ( num4 > 4 OR
                                                 ( num4 = 4 AND ( num5 > 5 OR
                                                                ( num5 = 5 AND num6 > 6)
                                                                )
                                                 )
                                                 )
                                  )
                                  )
                   )
                   )
  )

May be it will became better (or worse :) ).

Upvotes: 1

cyroxx
cyroxx

Reputation: 3877

If this is a valid solution to your problem: You should really consider to use a single surrogate key instead of a six-part natural key.

As you as can see from your own example, it is overly complicated to just do a lookup based on the primary key. Instead of taking only one column into account, multiple index lookups need to be performed. Each index lookup involves disk latency which in your case easily dominates the overall processing time.

Just see the query plan you posted, after the first lookup the number of rows returned is already down to two and querying the remaining indexes is costly compared to the number of rows that can be left out after each step (0-1 row).

Thus, if you only need to query a single column of type integer as the primary key, you should encounter some significant performance gain, as you can see from the SQLite docs:

The data for each table in SQLite is stored as a B-Tree structure containing an entry for each table row, using the rowid value as the key. This means that retrieving or sorting records by rowid is fast. Searching for a record with a specific rowid, or for all records with rowids within a specified range is around twice as fast as a similar search made by specifying any other PRIMARY KEY or indexed value.

With one exception, if a table has a primary key that consists of a single column, and the declared type of that column is "INTEGER" in any mixture of upper and lower case, then the column becomes an alias for the rowid. Such a column is usually referred to as an "integer primary key".

Moreover, your SQL queries will be simpler and can be maintained more easily.

Upvotes: 1

Related Questions