Daniel
Daniel

Reputation: 2592

Conditional FROM clause in SQLite

I'm optimizing a large SQLite sensor database for improving query performance.

For this I'm storing multiple resolution of average datas (as the goal is always getting the average of them).

So I have this as data:

CREATE TABLE "raw_data" (
  "id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  "date" integer NOT NULL,
  "measurement" integer NOT NULL
);

CREATE INDEX "idx_date_raw_data"
ON "raw_data" (
  "date" ASC
);

INSERT INTO "raw_data" (id, date, measurement) VALUES
(0, 1546300868,2170),
(1, 1546301078,2160),
(2, 1546301108,2170),
(3, 1546301198,2160),
(4, 1546301498,2150),
(5, 1546301588,2160),
(6, 1546301648,2150),
(7, 1546301707,2140),
(8, 1546301738,2150),
(9, 1546301797,2140);

CREATE TABLE "ten_avg_data" (
  "id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
  "date" integer NOT NULL,
  "measurement" integer NOT NULL
);

CREATE INDEX "idx_date_ten_avg_data"
ON "ten_avg_data" (
  "date" ASC
);

INSERT INTO "ten_avg_data" (id, date, measurement) VALUES
(0, 1546300868,2155);

And I want to query the average value:

SELECT
  COUNT(*)
FROM
  raw_data
WHERE
  date BETWEEN 1546300868 AND 1546302037;

-- If above count is < 10 do this:
SELECT
  AVG(measurement)
FROM
  raw_data
WHERE
  date BETWEEN 1546300868 AND 1546302037;

-- but if >=10, do this:
SELECT
  AVG(measurement)
FROM
  ten_avg_data
WHERE
  date BETWEEN 1546300868 AND 1546302037

As you can see, SELECT AVG(measurement) is common for the two queries, I just want to make the from clause conditional.

As raw_data data is huge, it's not good to join raw_data and ten_avg_data together, because that'd be slow.

So please consider that querying the average from raw_data takes 1 minute, and getting the same from ten_avg_data is 0.5s.

In this case what is the best approach to make the FROM clause conditional?

Playground is here.

Upvotes: 1

Views: 103

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269753

I think the best you can do is something like this:

WITH rd AS (
      SELECT COUNT(*) as cnt, AVG(measurement) as avg_m
      FROM raw_data
      WHERE date BETWEEN 1546300868 AND 1546302037;
    )
SELECT avg_m
FROM rd
WHERE cnt < 10
UNION ALL
SELECT AVG(measurement)
FROM ten_avg_data
WHERE date BETWEEN 1546300868 AND 1546302037 AND
      (SELECT cnt FROM rd) >= 10;

The cost of the COUNT() and AVG() is going to be very similar -- the expense is processing the data.

Over a minute for the first query seems way too long. You should consider an index on raw_data(date, measurement).

Upvotes: 0

sticky bit
sticky bit

Reputation: 37472

Well, one way to do it is to use a HAVING clause in the query over all_data that checks for the count(*) being greater than 10. If it isn't this query will yield the empty set. So you can use it as a subquery with NOT EXISTS in the WHERE clause of the second query making that second query yield the empty set, when the first one doesn't. Combine the results with UNION ALL and you'll get the result of the one not yielding the empty set.

SELECT avg(rd1.measurement)
       FROM raw_data rd1
       WHERE rd1.date BETWEEN 1546300868
                              AND 1546302037
       GROUP BY ''
       HAVING count(*) < 10
UNION ALL
SELECT avg(tad1.measurement)
       FROM ten_avg_data tad1
       WHERE tad1.date BETWEEN 1546300868
                               AND 1546302037
             AND NOT EXISTS (SELECT avg(rd1.measurement)
                                    FROM raw_data rd1
                                    WHERE rd1.date BETWEEN 1546300868
                                                           AND 1546302037
                                    GROUP BY ''
                                    HAVING count(*) < 10);

(Note that for some reason SQLite needs a GROUP BY clause in order to use a HAVING clause, we need to GROUP BY any odd constant as we actually one want one group (over the whole table).)

I'm not that sure however, if your approach really helps to boost performance. I don't know how good SQLite is in leveraging statistics to determine the count of rows will be greater than or equal to ten for the first query. Or if it at least will "stop" the first query when getting ten rows. If it cannot determine that right away, it needs to scan the relevant part of the index idx_date_raw_data anyway and you didn't gain anything (or at least not much) as it needed to do exactly that with only one query getting the average from all_data...

Upvotes: 1

Related Questions