Reputation: 2592
I'm optimizing a large SQLite sensor database for improving query performance.
For this I'm storing multiple resolution of average datas (as the goal is always getting the average of them).
So I have this as data:
CREATE TABLE "raw_data" (
"id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
"date" integer NOT NULL,
"measurement" integer NOT NULL
);
CREATE INDEX "idx_date_raw_data"
ON "raw_data" (
"date" ASC
);
INSERT INTO "raw_data" (id, date, measurement) VALUES
(0, 1546300868,2170),
(1, 1546301078,2160),
(2, 1546301108,2170),
(3, 1546301198,2160),
(4, 1546301498,2150),
(5, 1546301588,2160),
(6, 1546301648,2150),
(7, 1546301707,2140),
(8, 1546301738,2150),
(9, 1546301797,2140);
CREATE TABLE "ten_avg_data" (
"id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
"date" integer NOT NULL,
"measurement" integer NOT NULL
);
CREATE INDEX "idx_date_ten_avg_data"
ON "ten_avg_data" (
"date" ASC
);
INSERT INTO "ten_avg_data" (id, date, measurement) VALUES
(0, 1546300868,2155);
And I want to query the average value:
SELECT
COUNT(*)
FROM
raw_data
WHERE
date BETWEEN 1546300868 AND 1546302037;
-- If above count is < 10 do this:
SELECT
AVG(measurement)
FROM
raw_data
WHERE
date BETWEEN 1546300868 AND 1546302037;
-- but if >=10, do this:
SELECT
AVG(measurement)
FROM
ten_avg_data
WHERE
date BETWEEN 1546300868 AND 1546302037
As you can see, SELECT AVG(measurement)
is common for the two queries, I just want to make the from clause conditional.
As raw_data
data is huge, it's not good to join raw_data
and ten_avg_data
together, because that'd be slow.
So please consider that querying the average from raw_data
takes 1 minute, and getting the same from ten_avg_data
is 0.5s.
In this case what is the best approach to make the FROM clause conditional?
Playground is here.
Upvotes: 1
Views: 103
Reputation: 1269753
I think the best you can do is something like this:
WITH rd AS (
SELECT COUNT(*) as cnt, AVG(measurement) as avg_m
FROM raw_data
WHERE date BETWEEN 1546300868 AND 1546302037;
)
SELECT avg_m
FROM rd
WHERE cnt < 10
UNION ALL
SELECT AVG(measurement)
FROM ten_avg_data
WHERE date BETWEEN 1546300868 AND 1546302037 AND
(SELECT cnt FROM rd) >= 10;
The cost of the COUNT()
and AVG()
is going to be very similar -- the expense is processing the data.
Over a minute for the first query seems way too long. You should consider an index on raw_data(date, measurement)
.
Upvotes: 0
Reputation: 37472
Well, one way to do it is to use a HAVING
clause in the query over all_data
that checks for the count(*)
being greater than 10
. If it isn't this query will yield the empty set. So you can use it as a subquery with NOT EXISTS
in the WHERE
clause of the second query making that second query yield the empty set, when the first one doesn't. Combine the results with UNION ALL
and you'll get the result of the one not yielding the empty set.
SELECT avg(rd1.measurement)
FROM raw_data rd1
WHERE rd1.date BETWEEN 1546300868
AND 1546302037
GROUP BY ''
HAVING count(*) < 10
UNION ALL
SELECT avg(tad1.measurement)
FROM ten_avg_data tad1
WHERE tad1.date BETWEEN 1546300868
AND 1546302037
AND NOT EXISTS (SELECT avg(rd1.measurement)
FROM raw_data rd1
WHERE rd1.date BETWEEN 1546300868
AND 1546302037
GROUP BY ''
HAVING count(*) < 10);
(Note that for some reason SQLite needs a GROUP BY
clause in order to use a HAVING
clause, we need to GROUP BY
any odd constant as we actually one want one group (over the whole table).)
I'm not that sure however, if your approach really helps to boost performance. I don't know how good SQLite is in leveraging statistics to determine the count of rows will be greater than or equal to ten for the first query. Or if it at least will "stop" the first query when getting ten rows. If it cannot determine that right away, it needs to scan the relevant part of the index idx_date_raw_data
anyway and you didn't gain anything (or at least not much) as it needed to do exactly that with only one query getting the average from all_data
...
Upvotes: 1