Secret
Secret

Reputation: 3358

Querying for latest and time-intervaled data

My data set is quite similar to stock prices - I get an average price, symbol, company name, every 5 minutes.

What I need to do is be able to efficiently:

Currently, this is what I have so far:

Stock Table
ID | name | symbol
Unique between name and symbol (name/symbol can be unique on their own)
Indexed on name, symbol (irrelevant here, but indexed for text search)

Stock Ticks Table
ID | stock_id | price | updated_at
All columns non null

For question 1 (Get latest price for a given symbol), I'm already having a bit of trouble - this is essentially similar to other questions:

Get latest comments for each post or essentially a greatest-n-per-group query. Thing is my data is going to get really large (every 5 minutes) so I think this is a good case for preoptimization. Should I add a current_price column (or current_prices table)? Or is a group by/distinct? How would that query be efficiently written?

For question 2 (Get the time intervaled prices) I'm honestly a bit troubled how to write a query for that. Note that the tricky part is there might be holes in the data, so for example, if the query is:

get every price per day from June 1 to June 10

and say there's no data from June 3, then it should try and find the closest time from that (either past or present)

I am writing this on Phoenix/Ecto, so if you could write it with the ORM that would be a plus but not required.

Upvotes: 0

Views: 62

Answers (1)

flutter
flutter

Reputation: 754

Assuming PostgreSQL v9.6 (you didn't specify).

The stock table definition

CREATE TABLE stock ( id serial NOT NULL PRIMARY KEY, 
                     name text NOT NULL UNIQUE, 
                     symbol text NOT NULL UNIQUE );

and the price table definition

CREATE TABLE pricing ( id int NOT NULL REFERENCES stock (id), 
                       updated_at TIMESTAMP(0) NOT NULL, 
                       price NUMERIC( 10, 2 ) NOT NULL, 
                       PRIMARY KEY (id, updated_at) );

And to accelerate price lookups by date without stock_ids

CREATE INDEX ON pricing (updated_at);

Example stock values are (1, 'Queen', 'BEE'), (2, 'Team Fox', 'FOX'). Example pricing values are

 (1, '2017-06-17 13:24:59', 12.34), 
 (1, '2017-06-01 18:00:00', 6.10), 
 (1, '2017-06-02 17:00:00', 6.20), 
 (1, '2017-06-03 17:00:00', 6.30), 
 (2, '2017-06-02 15:00:00', 100.00), 
 (2, '2017-06-03 15:30:00', 777.00);

To get the latest prices for all symbols

SELECT s.*, 
       (SELECT price 
        FROM pricing 
        WHERE id = s.id 
        ORDER BY updated_at DESC 
        LIMIT 1) "latest_price" 
FROM stock s 
WHERE EXISTS (SELECT id FROM pricing p WHERE p.id = s.id);

which will exclude NULL values for the latest_price using the WHERE EXISTS condition. Leave it out to get nulls for stocks for which no price is known yet.

An alternative to

get the latest prices

is

SELECT * 
FROM stock s 
JOIN LATERAL 
(SELECT price 
 FROM pricing p 
 WHERE p.id = s.id 
 ORDER BY updated_at DESC 
 LIMIT 1) latest_price ON true ;

And to

get every price per day from June 1 to June 10

generate the appropriate dates first with

WITH RECURSIVE dates (d) AS ( 
SELECT '2017-06-01 20:00'::timestamp 
UNION ALL 
SELECT d + interval '24 hours' 
FROM dates 
WHERE d < '2017-06-04 20:00'::timestamp 
) 
SELECT d FROM dates ;

which produces

          d          
---------------------
 2017-06-01 20:00:00
 2017-06-02 20:00:00
 2017-06-03 20:00:00
 2017-06-04 20:00:00

You can adapt the following parameters

  • the start date ('2017-06-01 20:00'::timestamp)
  • the end date ('2017-06-04 20:00'::timestamp)
  • the time steps (interval '24 hours')

With the dates use

WITH RECURSIVE dates (d) AS ( 
  SELECT '2017-06-01 20:00'::timestamp 
  UNION ALL 
  SELECT d + interval '24 hours' 
  FROM dates 
  WHERE d < '2017-06-04 20:00'::timestamp 
) 
SELECT symbol, CAST( d AS date ) "day", price, updated_at  
FROM stock s 
CROSS JOIN dates  
JOIN LATERAL ( 
  SELECT updated_at, price  
  FROM pricing p 
  WHERE p.id = s.id AND p.updated_at <= d 
  ORDER BY p.updated_at DESC 
  LIMIT 1 
) latest_prices ON true 
ORDER BY 1, 2, 4, 3 ;
-- orders same id = name blocks; 
-- use 2, 1, 4, 3 to get same date blocks

to get for example

 symbol | day            | price  |     updated_at      
--------+----------------+--------+---------------------
 FOX    | 2017-06-02     | 100.00 | 2017-06-02 15:00:00 
 FOX    | 2017-06-03     | 777.00 | 2017-06-03 15:30:00
 FOX    | 2017-06-04     | 777.00 | 2017-06-03 15:30:00 -- price from 3rd
 BEE    | 2017-06-01     |   6.10 | 2017-06-01 18:00:00
 BEE    | 2017-06-02     |   6.20 | 2017-06-02 17:00:00
 BEE    | 2017-06-03     |   6.30 | 2017-06-03 17:00:00
 BEE    | 2017-06-04     |   6.30 | 2017-06-03 17:00:00 -- price from 3rd

Upvotes: 1

Related Questions