Reputation: 304
I have 2 tables that need joined together, and currently when I query them, I'm returned with a lot of identical rows with one column having a different value. I'd suspect it's performing an inner join across all matches, so I tried left joining so values were attached to date, but that appears to be incorrect.
I am trying to get the last entered value for each day across the following columns. Symbol is not unique, but all others have different values across different time periods within.
So my questions are:
How would I select the max value for each date across each column and return it in a single row?
How would I select the most recent value for each date across each column and return it in a single row?
My current query looks like this.
SELECT
DISTINCT(shortdata.date),
shortdata.symbol,
shortdata.fee_rate,
shortdata.rebate_rate,
shortdata.short_shares_available,
tss.total_volume as share_volume,
FROM `db1.table1` as shortdata
LEFT JOIN `db1.table2` as tss
ON shortdata.symbol = tss.ticker
AND shortdata.date = tss.date
WHERE shortdata.symbol in ('AMC')
ORDER BY date desc
LIMIT 25
Which is returning a query with 6 values for each identifier.
date symbol fee_rate rebate_rate short_shares_available share_volume
2023-01-20 AMC 100.3354 -96.0154 200000 30313546
2023-01-20 AMC 97.206 -92.886 65000 30313546
2023-01-20 AMC 100.3354 -96.0154 200000 489689
2023-01-20 AMC 97.206 -92.886 65000 31271340
2023-01-20 AMC 100.3354 -96.0154 200000 31271340
2023-01-20 AMC 97.206 -92.886 65000 489689
2023-01-19 AMC 122.3875 -118.0675 300000 29182367
2023-01-19 AMC 117.3614 -113.0414 300000 29734773
2023-01-19 AMC 113.7761 -109.4561 300000 801000
2023-01-19 AMC 113.7761 -109.4561 300000 29182367
2023-01-19 AMC 122.3875 -118.0675 300000 801000
2023-01-19 AMC 113.7761 -109.4561 300000 29734773
2023-01-19 AMC 122.3875 -118.0675 300000 29734773
2023-01-19 AMC 117.3614 -113.0414 300000 29182367
2023-01-19 AMC 117.3614 -113.0414 300000 801000
2023-01-18 AMC 106.2183 -101.8983 150000 2294874
2023-01-18 AMC 106.2183 -101.8983 150000 61230037
2023-01-18 AMC 106.2183 -101.8983 150000 62117798
2023-01-17 AMC 105.4728 -101.1528 100000 57591052
2023-01-17 AMC 105.4436 -101.1236 150000 759340
2023-01-17 AMC 120.211 -115.891 150000 759340
2023-01-17 AMC 105.4436 -101.1236 150000 57591052
2023-01-17 AMC 105.4728 -101.1528 100000 56101661
2023-01-17 AMC 105.4436 -101.1236 150000 56101661
2023-01-17 AMC 107.1329 -102.8129 200000 57591052
The expected end result would be something like:
date symbol fee_rate rebate_rate short_shares_available share_volume
2023-01-20 AMC 100.3354 -96.0154 200000 31271340
2023-01-19 AMC 122.3875 -118.0675 300000 29182367
2023-01-18 AMC 106.2183 -101.8983 150000 2294874
2023-01-17 AMC 107.1329 -102.8129 200000 57591052
Here is a sample query from each table:
db1
date symbol fee_rate rebate_rate short_shares_available
2023-01-20 AMC 100.3354 -96.0154 200000
2023-01-20 AMC 97.206 -92.886 65000
2023-01-19 AMC 117.3614 -113.0414 300000
2023-01-19 AMC 113.7761 -109.4561 300000
2023-01-19 AMC 122.3875 -118.0675 300000
2023-01-18 AMC 106.2183 -101.8983 150000
2023-01-17 AMC 107.1329 -102.8129 200000
2023-01-17 AMC 105.4728 -101.1528 100000
2023-01-17 AMC 105.4436 -101.1236 150000
2023-01-17 AMC 120.211 -115.891 150000
db2 Note: tape time is only to show where one could get most recent value. Tape time on the 21st will be the last entry from the 20th, etc.
date ticker tape_time share_volume
2023-01-20 AMC 2023-01-21 00:59:54.000000 UTC 31271340
2023-01-20 AMC 2023-01-20 14:28:56.000000 UTC 489689
2023-01-20 AMC 2023-01-20 20:59:58.000000 UTC 30313546
2023-01-19 AMC 2023-01-19 14:29:56.000000 UTC 801000
2023-01-19 AMC 2023-01-19 20:59:58.000000 UTC 29182367
2023-01-19 AMC 2023-01-20 00:59:45.000000 UTC 29734773
2023-01-18 AMC 2023-01-19 00:58:06.000000 UTC 62117798
2023-01-18 AMC 2023-01-18 20:59:59.000000 UTC 61230037
2023-01-18 AMC 2023-01-18 14:29:40.000000 UTC 2294874
2023-01-17 AMC 2023-01-18 00:59:42.000000 UTC 57591052
I have found snippets on a theoretical how this could be completed, but I'm unable to connect the dots further.
Any help would be appreciated.
Upvotes: 0
Views: 166
Reputation: 1820
You can consider the approach below for your requirement.
SELECT distinct d1.date, d1.symbol, d1.fee_rate, d1.rebate_rate,
d1.short_shares_available, d2.share_volume
FROM `project.dataset.db1` d1 left join `project.dataset.db2` d2
on d1.symbol=d2.ticker and d1.date = d2.date
WHERE true
QUALIFY 1 = ROW_NUMBER() OVER (PARTITION BY d1.date ORDER BY d1.date desc)
Upvotes: 1
Reputation: 14861
The initial issue seems to be a misunderstanding of DISTINCT. It is not a function and no matter how parenthesized it always applies to the entire row. Duplicate rows are removed but if a single column is different then it is not a duplicate row and therefore not removed. This is the issue you have. But Postgres offers a solution: DISTINCT ON(...). When used it returns only the first row of set of rows containing the specified value(s). In your case perhaps:
SELECT DISTINCT on(shortdata.date), --- will select just the first row for each date
shortdata.date --- (Not selected by above, repeat to return)
shortdata.symbol,
shortdata.fee_rate,
shortdata.rebate_rate,
shortdata.short_shares_available,
tss.total_volume as share_volume,
FROM db1.table1 as shortdata
LEFT JOIN db1.table2 as tss
ON shortdata.symbol = tss.ticker
AND shortdata.date = tss.date
WHERE shortdata.symbol in ('AMC')
ORDER BY shortdata.date desc --- based upon this criteria
LIMIT 25;
Upvotes: 0