Reputation: 1254
I have a table that is comprised of 6 numbers as the primary key
CREATE TABLE table1 ( num1 decimal, num2 int, num3 int, num4 bigint, num5 bigint, num6 bigint,
PRIMARY KEY (num1, num2, num3, num4, num5, num6))
I need to access the table in sorted order and often times I have a need to query the table to find the next N large numbers in order and their associated data.
So the query I wrote was something like this
SELECT * FROM table1 WHERE
num1 >? OR (
(num1 == ? AND num2 > ?) OR (
(num1 == ? AND num2 == ? AND num3 > ?) OR (
(num1 == ? AND num2 == ? AND num3 == ? AND num4 > ? OR (
(num1 == ? AND num2 == ? AND num3 == ? AND num4 == ? AND num5 > ?) OR (
(num1 == ? AND num2 == ? AND num3 == ?
AND num4 == ? AND num5 == ? AND num6 > ?)))))) ORDER BY num1, num2, num3, num4, num5, num6
LIMIT ?;
This was the best way I could see to find the next largest key, and this does query in the order of the index however....query takes a few seconds, which is something that I'm not to fond of.
Is there any way to improve the performance? This takes a few seconds to execute on a table of 10million rows and I need it to execute more on the order of 100ms.
Query Plan:
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1>?) (~250000 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2>?) (~2 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3>?) (~2 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3=? AND num4>?) (~2 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3=? AND num4=? AND num5>?) (~1 rows)"
"SEARCH TABLE table1 USING INDEX sqlite_autoindex_table1_1 (num1=? AND num2=? AND num3=? AND num4=? AND num5=? AND num6>?) (~1 rows)"
"USE TEMP B-TREE FOR ORDER BY"
Edit:
Why is this not possible? I literally want to get things in the INDEXED ORDER, the same order generated by the ORDER BY
keyword?
Upvotes: 1
Views: 652
Reputation: 73226
Contrary to other more sophisticated RDBMS, sqlite has a rule-based query optimizer, meaning that the execution plan mostly depends on the way the query is written (and the order of the clauses). It makes the optimizer quite predictable, and if you know how sqlite generates execution plans, you can take benefit of this predictability to solve your issue.
A first idea is to note that the various clauses like (num1>?) or (num1=? and num2>?) are producing disjoint results, and that these results are naturally sorted between each others. If the query is divided in subqueries (each of them handling a part of the condition) producing sorted results, then the concatenation of all the result-sets is also sorted, if the subqueries are executed in the correct order.
For example, consider the following queries:
select * from table1 where num1=? and num2>? order by num1,num2
select * from table1 where num1>? order by num1,num2
The two result sets produced by these queries are disjoint, and the rows of the first result-set are always ordered before the rows of the second result-set.
The second idea is to understand how sqlite handles the LIMIT clause. Actually, it declares a counter at the begining of the query, and decrement and test this counter at each selected row, so it can stop a query early.
For instance, consider the following query:
.explain
explain select * from (
select * from table1 where num1=? and num2>?
union all
select * from table1 where num1>?
) limit 10;
sqlite will evaluate the subqueries in the order specified in the query. If the first subquery returns more than 10 rows, the second subquery will not even be executed. It can be easily checked by displaying the plan:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Trace 0 0 0 00
1 Integer 10 1 0 00
2 Variable 1 2 2 00
3 Goto 0 44 0 00
4 OpenRead 3 3 0 keyinfo(6,BINARY,BINARY) 00
5 SCopy 2 4 0 00
6 IsNull 4 23 0 00
7 SCopy 3 5 0 00
8 IsNull 5 23 0 00
9 Affinity 4 2 0 cd 00
10 SeekGt 3 23 4 2 00
11 IdxGE 3 23 4 1 01
12 Column 3 1 6 00
13 IsNull 6 22 0 00
14 Column 3 0 7 00
15 Column 3 1 8 00
16 Column 3 2 9 00
17 Column 3 3 10 00
18 Column 3 4 11 00
19 Column 3 5 12 00
20 ResultRow 7 6 0 00
21 IfZero 1 23 -1 00
22 Next 3 11 0 00
23 Close 3 0 0 00
24 IfZero 1 43 0 00
25 Variable 3 13 1 00
26 OpenRead 4 3 0 keyinfo(6,BINARY,BINARY) 00
27 SCopy 13 14 0 00
28 IsNull 14 42 0 00
29 Affinity 14 1 0 c 00
30 SeekGt 4 42 14 1 00
31 Column 4 0 6 00
32 IsNull 6 41 0 00
33 Column 4 0 7 00
34 Column 4 1 8 00
35 Column 4 2 9 00
36 Column 4 3 10 00
37 Column 4 4 11 00
38 Column 4 5 12 00
39 ResultRow 7 6 0 00
40 IfZero 1 42 -1 00
41 Next 4 31 0 00
42 Close 4 0 0 00
43 Halt 0 0 0 00
44 Transaction 0 0 0 00
45 VerifyCookie 0 3 0 00
46 TableLock 0 2 0 table1 00
47 Goto 0 4 0 00
The counter is declared step 1, and decremented/tested at steps 21, 24, 40.
By combining these two remarks, we can propose a query which is not pretty, but will produce an efficient execution plan:
SELECT * FROM (
SELECT * FROM ( SELECT * FROM table1
WHERE num1 == ? AND num2 == ? AND num3 == ? AND num4 == ? AND num5 == ? AND num6 > ?
ORDER BY num1, num2, num3, num4, num5, num6 )
UNION ALL
SELECT * FROM ( SELECT * FROM table1
WHERE num1 == ? AND num2 == ? AND num3 == ? AND num4 == ? AND num5 > ?
ORDER BY num1, num2, num3, num4, num5, num6 )
UNION ALL
SELECT * FROM ( SELECT * FROM table1
WHERE num1 == ? AND num2 == ? AND num3 == ? AND num4 > ?
ORDER BY num1, num2, num3, num4, num5, num6 )
UNION ALL
SELECT * FROM ( SELECT * FROM table1
WHERE num1 == ? AND num2 == ? AND num3 > ?
ORDER BY num1, num2, num3, num4, num5, num6 )
UNION ALL
SELECT * FROM ( SELECT * FROM table1
WHERE num1 == ? AND num2 > ?
ORDER BY num1, num2, num3, num4, num5, num6 )
UNION ALL
SELECT * FROM ( SELECT * FROM table1
WHERE num1 > ?
ORDER BY num1, num2, num3, num4, num5, num6 )
) LIMIT ?;
Note that because the "order by" clause is not required in the outer query, there is no need for sqlite to execute all the subqueries. So it can just stop when it has the correct number of rows. The order of the subqueries is critical.
The second level inner subqueries are needed because it is not possible to use "order by" before "union all". They are optimized away by sqlite, so it is not an issue.
On a dummy table containing 777K rows, the initial query costs:
strace -c -eread,lseek sqlite3 toto.db < q1.sql
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
63.57 0.001586 0 18556 read
36.43 0.000909 0 18544 lseek
------ ----------- ----------- --------- --------- ----------------
100.00 0.002495 37100 total
while mine only costs:
strace -c -eread,lseek sqlite3 toto.db < q3.sql
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
-nan 0.000000 0 18 read
-nan 0.000000 0 8 lseek
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 26 total
Upvotes: 1
Reputation: 1734
I think with such cases query optimizer should deal, but in aqlite it's very simple, so it's really would be better to change table structure as @cyroxx wrote.
Other thoughts: you can also try to rewrite query in some other way, and may be optimizer will understand what is needed. For example you can try:
SELECT * FROM table1 WHERE
num1 > 1 OR
( num1 = 1 AND ( num2 > 2 OR
( num2 = 2 AND ( num3 > 3 OR
( num3 = 3 AND ( num4 > 4 OR
( num4 = 4 AND ( num5 > 5 OR
( num5 = 5 AND num6 > 6)
)
)
)
)
)
)
)
)
May be it will became better (or worse :) ).
Upvotes: 1
Reputation: 3877
If this is a valid solution to your problem: You should really consider to use a single surrogate key instead of a six-part natural key.
As you as can see from your own example, it is overly complicated to just do a lookup based on the primary key. Instead of taking only one column into account, multiple index lookups need to be performed. Each index lookup involves disk latency which in your case easily dominates the overall processing time.
Just see the query plan you posted, after the first lookup the number of rows returned is already down to two and querying the remaining indexes is costly compared to the number of rows that can be left out after each step (0-1 row).
Thus, if you only need to query a single column of type integer as the primary key, you should encounter some significant performance gain, as you can see from the SQLite docs:
The data for each table in SQLite is stored as a B-Tree structure containing an entry for each table row, using the rowid value as the key. This means that retrieving or sorting records by rowid is fast. Searching for a record with a specific rowid, or for all records with rowids within a specified range is around twice as fast as a similar search made by specifying any other PRIMARY KEY or indexed value.
With one exception, if a table has a primary key that consists of a single column, and the declared type of that column is "INTEGER" in any mixture of upper and lower case, then the column becomes an alias for the rowid. Such a column is usually referred to as an "integer primary key".
Moreover, your SQL queries will be simpler and can be maintained more easily.
Upvotes: 1