Finding the initial sampled time window after using SAMPLE BY again

Question

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.

Input

My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.

The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.

Output

Using the following query, I can get the average galon price per month just fine.

SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M

timestamp	avg_per_month
2000-06-05T00:00:00.000000Z	1.6724
2000-07-05T00:00:00.000000Z	1.69275
2000-08-05T00:00:00.000000Z	1.635
...	...

Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:

SELECT timestamp, max(avg_per_month) as max_per_year FROM (
  SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M

timestamp	max_per_year
2000-01-05T00:00:00.000000Z	1.69275
2001-01-05T00:00:00.000000Z	1.767399999999
2002-01-05T00:00:00.000000Z	1.52075
...	...

Wanted output

I want to know which month was responsible for the maximum price of a year.

Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.

For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.

timestamp	max_per_year	which_month_is_responsible
2000-01-05T00:00:00.000000Z	1.69275	2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z	1.767399999999	2001-05-05T00:00:00.000000Z
...	...

What did I try?

I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?

Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.

Bolek Ziobrowski · Accepted Answer

I think 'wanted output' can be achieved with window functions. Please have a look at:

CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);

INSERT INTO electricity 
SELECT (x*1000000)::timestamp, rnd_double() 
FROM long_sequence(10000000);

SELECT day, ts, max_per_day 
FROM 
(
  SELECT timestamp_floor('d', ts) as day, 
         ts, 
         avg_in_15_min as max_per_day, 
         row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day 
  FROM 
  (
    SELECT ts, avg(consumption) as avg_in_15_min 
    FROM electricity 
    SAMPLE BY 15m
  ) 
) WHERE rn_per_day = 1

Finding the initial sampled time window after using SAMPLE BY again

Answers (1)

Related Questions