Reputation: 343
I'm very new to working with DBs/SQL and doing my best to get adjusted I'm using SQLAlchemy/Postgres to write data from a Python API that I write market data from every 10 minutes (inserted into DB). This scraper is intended to get stock minute "candles", so it should be inserting 10 rows into each stock every time. I started running it last night and noticed the numbers are a bit off, in that I did a SELECT count(*) FROM exchange WHERE market='x_market';
in pgcli and I got a count of 900, where it should be around 1000 (started running 1k minutes ago). Essentially, what I want to do (if possible) is to see if there are any gaps (data it didn't catch) between rows. Each row has a unix timestamp, and each timestamp "below it" should be 60000 ms (1 min) different. I know in Python I could just iterate through and check that, but I am interested in learning more about SQL (would be much nicer to just check on pgcli). Is it possible to check this (with SQL)? I am attaching a screenshot to show schema/what I mean.
Thanks very much in advance.
(For pic, timestamp (unix) is first column, and the remaining others are just the equity price data)
Upvotes: 1
Views: 384
Reputation: 7441
One way to get the gaps or "missing" timestamps would be to use generate_series
to create a series of timestamps separated by 60000ms, then join to your table. Like so:
-- Simulate some data with a gap
WITH exchange(tstamp, val) AS
(
VALUES
(1551118800000, 0.03626),
(1551118860000, 0.036243),
(1551118980000, 0.03627)
)
-- Return timestamp(s) that do not match to exchange table data
SELECT stamps.tstamp AS missing_timestamp
FROM exchange
RIGHT JOIN generate_series(1551118800000, 1551118980000, 60000) stamps(tstamp)
ON exchange.tstamp = stamps.tstamp
WHERE exchange.tstamp IS NULL
-- Or using a more explicit anti-JOIN (cleaner?)
SELECT stamps.tstamp AS missing_timestamp
FROM generate_series(1551118800000, 1551118980000, 60000) stamps(tstamp)
WHERE NOT EXISTS (SELECT 1 FROM exchange WHERE tstamp = stamps.tstamp)
The first argument in generate_series
is the minimum timestamp and the second is the maximum. You can alter for your use case. The final argument is the 'step' (60000ms).
Result (given the simulated data above):
missing_timestamp
------
1551118920000
Upvotes: 1