punkish
punkish

Reputation: 15248

Speeding up COUNT DISTINCT queries in SQLite

Update: added background info and more explanation

Given ~300K records in the following table

CREATE TABLE t (
    id INTEGER PRIMARY KEY, 
    a TEXT, 
    b TEXT, 
    c TEXT, 
    deleted INTEGER DEFAULT 0
);

CREATE INDEX ix_t ON t (deleted) WHERE deleted = 0;

1) SELECT Count(id) FROM t WHERE deleted = 0;
2) SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
3) SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
4) SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;

Query 1 takes 4-5 ms. The other three queries COUNT (DISTINCT <col>) take 600-900ms. How can I speed up these queries to the same order as the first one? I created following indexes but it didn't help

CREATE INDEX ix_t_a ON t (a, deleted) WHERE deleted = 0;
-- (and so on the columns b and c as well)

EXPLAIN QUERY PLAN SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;

-- output: SEARCH TABLE t USING INDEX ix_t (deleted=?)

As we can see above, the new index is not being used.

The deleted column is a flag to keep track of rows that are to be excluded from all queries. For the most part, such rows will be very few, but it is essential that they are not used in the counts and selects.

Eventually the number of rows might grow to 3 times as much, let's say 1M. Even then, the number of deleted rows will be minimal.

The COUNT (DISTINCT column) counts are to create facets. Let's say someone searches for all rows WHERE a = 'foo'. I need to return the matching rows, and I also need to return how many DISTINCT b and c are present in those rows. So, I will do something like

-- number of rows in the result set
SELECT COUNT(id) FROM t WHERE deleted = 0 AND a = 'foo';

-- the actual result set
SELECT * FROM t WHERE deleted = 0 AND a = 'foo' LIMIT 30 OFFSET 0;

-- facet counts
SELECT COUNT (DISTINCT b) FROM t WHERE deleted = 0 AND a = 'foo';
SELECT COUNT (DISTINCT c) FROM t WHERE deleted = 0 AND a = 'foo';

In the very first instance, since there is no WHERE clause, the above queries will be

-- number of rows in the result set
SELECT COUNT(id) FROM t WHERE deleted = 0;

-- the actual result set
SELECT * FROM t WHERE deleted = 0 LIMIT 30 OFFSET 0;

-- facet counts
SELECT COUNT (DISTINCT a) FROM t WHERE deleted = 0;
SELECT COUNT (DISTINCT b) FROM t WHERE deleted = 0;
SELECT COUNT (DISTINCT c) FROM t WHERE deleted = 0;

The actual columns in the table that make up the facets are about 8 or 9. So, I have to do about 8 or 9 SELECT COUNT (DISTINCT col). With each taking about 600-900ms, that is almost 6-10 seconds. Way too slow for a query from the point-of-view of the user. Reducing the selects by half or ⅔ makes a huge difference.

In reality, I also have a query cache, so the results of any query performed once will be cached, and will be very fast the second time, as long as it is exactly the same query. Here, by query, I mean a query from the pov of the user. Of course, each query from the pov of the user results in 9-10 database queries. Nevertheless, speed is very important to create a performant application.

Upvotes: 0

Views: 808

Answers (1)

MikeT
MikeT

Reputation: 56948

How can I speed up these queries to the same order as the first one?

Part 1

I'm not sure if you can that much and if so it may well depend upon the data itself e.g. if analyze does some amazing things.

DISTINCT has quite some impact. So that may be the sole reason for the discrepancy

If you just use EXPLAIN for the first query without DISTINCT the result is:-

enter image description here

For the other queries the EXPLAIN result is :-

enter image description here

How can I speed up these queries to the same order as the first one?

Part 2

Perhaps consider adapting the following code that was used to have a little play and conclude that ANALYZE may not improve matters and nor does playing around (a little bit) with the indexes appear to make much difference :-

DROP TABLE IF EXISTS t;
CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0);

WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1;

SELECT * FROM t; /* COMMENT/UNCOMMENT WITH/WITHOUT -- To HIDE/SHOW TABLE */ 


CREATE INDEX ix_t ON t (deleted) WHERE deleted = 0;

-- EXPLAIN  /* UNCOMMENT BY REMOVING -- ON THIS LINE TO JUST DO EXPLAIN */
-- QUERY PLAN /* TO DO EXPLAIN QUERY PLAN UNCOMMNET LINE ABOVE AS ABOVE and UNCOMMENT THIS SAME WAY */
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;
ANALYZE;
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;

/*<<<<<<<<<< ATTEMPT 2 >>>>>>>>>>*/
DROP TABLE IF EXISTS t;
CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0);

WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1;

-- SELECT * FROM t;


CREATE INDEX ix_t ON t (deleted,a,b,c) WHERE deleted = 0;

-- EXPLAIN 
-- QUERY PLAN
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;
ANALYZE;
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;

This when run (assuming the tool used (ok in Navicat) allows a series of queries). Will :-

  1. Drops and Creates the table t
  2. Populate t with random data (except rowids)
  3. Create the index (see code as 2nd part is different asis)
  4. Runs the 4 queries (see comments re turning explain or explain query plan on or off)
  5. Does an analzye.
  6. Runs the same 4 queries post analyze.
  7. Repeats 1-6 (with different index).

Example log :-

DROP TABLE IF EXISTS t
> OK
> Time: 1.229s


CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0)
> OK
> Time: 0.132s


WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1
> Affected rows: 300000
> Time: 0.75s


-- SELECT * FROM t;


CREATE INDEX ix_t ON t (deleted) WHERE deleted = 0
> OK
> Time: 0.195s


EXPLAIN 
-- QUERY PLAN
SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0s


EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0s


EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0s


EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0s


ANALYZE
> OK
> Time: 0.137s


SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0.031s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0.057s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0.054s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0.055s


/*<<<<<<<<<< ATTEMPT 2 >>>>>>>>>>*/
DROP TABLE IF EXISTS t
> OK
> Time: 0.891s


CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0)
> OK
> Time: 0.153s


WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1
> Affected rows: 300000
> Time: 0.643s


-- SELECT * FROM t;


CREATE INDEX ix_t ON t (deleted,a,b,c) WHERE deleted = 0
> OK
> Time: 0.583s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0.029s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0.041s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0.029s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0.031s


ANALYZE
> OK
> Time: 0.121s


SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0.038s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0.046s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0.029s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0.031s

Part 3

I haven't done much working out with timings but perhaps consider this (using a tempt table with only deleted = 0 rows) :-

DROP TABLE IF EXISTS t;
CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0);

WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1;

-- SELECT * FROM t; /* COMMENT/UNCOMMENT WITH/WITHOUT -- To HIDE/SHOW TABLE */ 

CREATE TEMP TABLE trimmedt AS SELECT * FROM t WHERE deleted = 0;


-- EXPLAIN  /* UNCOMMENT BY REMOVING -- ON THIS LINE TO JUST DO EXPLAIN */
-- QUERY PLAN /* TO DO EXPLAIN QUERY PLAN UNCOMMNET LINE ABOVE AS ABOVE and UNCOMMENT THIS SAME WAY */
SELECT Count(id) FROM trimmedt;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM trimmedt;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM trimmedt;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM trimmedt;

Message Log :-

DROP TABLE IF EXISTS t
> OK
> Time: 1.242s


CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0)
> OK
> Time: 0.103s


WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1
> Affected rows: 300000
> Time: 0.722s


-- SELECT * FROM t; /* COMMENT/UNCOMMENT WITH/WITHOUT -- To HIDE/SHOW TABLE */ 

CREATE TEMP TABLE trimmedt AS SELECT * FROM t WHERE deleted = 0
> OK
> Time: 0.091s


-- EXPLAIN  /* UNCOMMENT BY REMOVING -- ON THIS LINE TO JUST DO EXPLAIN */
-- QUERY PLAN /* TO DO EXPLAIN QUERY PLAN UNCOMMNET LINE ABOVE AS ABOVE and UNCOMMENT THIS SAME WAY */
SELECT Count(id) FROM trimmedt
> OK
> Time: 0.009s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM trimmedt
> OK
> Time: 0.03s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM trimmedt
> OK
> Time: 0.03s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM trimmedt
> OK
> Time: 0.031s

Upvotes: 2

Related Questions