Javier Ramirez
Javier Ramirez

Reputation: 4032

Find the first occurrence of each unique value in a QuestDB table

I have a QuestDB table storing market data, and I need to find the first recorded trade for each stock symbol in the past month.

This is my schema

CREATE TABLE 'trades' ( 
    symbol SYMBOL CAPACITY 256 CACHE,
    side SYMBOL CAPACITY 256 CACHE,
    price DOUBLE,
    amount DOUBLE,
    timestamp TIMESTAMP
) timestamp(timestamp) PARTITION BY DAY WAL;

Sample Data

symbol side price amount timestamp
BTC-USDT buy 84717.7 0.00177074 2025-02-28T00:00:00.114000Z
BTC-USDT buy 84718.2 0.00235000 2025-02-28T00:00:05.210000Z
ETH-USDT buy 2307.77 0.064997 2025-02-28T00:00:00.520999Z
ETH-USDT buy 2308.50 0.032100 2025-02-28T00:00:10.620000Z
DOGE-USDT buy 0.20691 99.934973 2025-02-28T00:00:00.542000Z
DOGE-USDT buy 0.20694 50.000000 2025-02-28T00:00:02.847000Z
SOL-USDT buy 137.70 0.014524 2025-02-28T00:00:00.604000Z
SOL-USDT buy 138.20 0.500000 2025-02-28T00:00:07.340000Z

I can get the first trade per symbol with min MIN(timestamp), like this:

SELECT symbol, min(timestamp) FROM trades 
WHERE timestamp IN today(); 

But this returns just the first timestamp per symbol, and I need the whole row. If I wanted the last row, I could do:

SELECT * from trades latest on timestamp partition by symbol;

But I want the reverse of tat. I tried adding LATEST ON timestamp DESC but then it complains.

Is there an equivalent to LATEST ON for EARLIEST ON? Otherwise, How can I modify this query to get the first row of the day for each different symbol?

Upvotes: 0

Views: 10

Answers (1)

Javier Ramirez
Javier Ramirez

Reputation: 4032

This can be solved in at least two different ways.

The first one would be to group by symbol and just get the FIRST value for every column, as in

select symbol, first(timestamp) as timestamp, first(side) as side, first(price) as price, first(amount) as amount from trades
where timestamp in today();

Note in QuestDB data is implicitly sorted by increasing timestamp, so no need to do an ORDER BY before doing the GROUP BY.

This solution should work fine and be efficient. The only downside to me is that you need to write a lot of FIRST..AS to aggregate and to then get the columns with the right name.

An alternative I sometimes use when I have many columns is to use a window function. It is slower, so I tend to use it only when querying a small part of my data. Since our query is scoped to today() we should be OK with this alternative.

WITH ranked_trades AS (
    SELECT 
        *,
        row_number() OVER (PARTITION BY symbol ORDER BY timestamp) AS rn
    FROM trades
    WHERE timestamp IN today()
)
SELECT *
FROM ranked_trades
WHERE rn = 1;

This query would first find the position number (as rn) for each row, relative to rows with the same symbol, and then on a second pass it would filter only the ones with rn 1. It performs slower than the GROUP BY as it needs to go over all the rows to calculate the row number and then discard all subsequent rows on the second pass, but on the plus side you can just do select * and write the query way faster.

I tend to use this second approach when I am doing interactive things on the web console, and the first one when I am automating a dashboard or anything that will run the query frequently.

Upvotes: 0

Related Questions