jd_h2003
jd_h2003

Reputation: 304

Bigquery: How to extract value by date across different tables?

I have 2 tables that need joined together, and currently when I query them, I'm returned with a lot of identical rows with one column having a different value. I'd suspect it's performing an inner join across all matches, so I tried left joining so values were attached to date, but that appears to be incorrect.

I am trying to get the last entered value for each day across the following columns. Symbol is not unique, but all others have different values across different time periods within.

So my questions are:

  1. How would I select the max value for each date across each column and return it in a single row?

  2. How would I select the most recent value for each date across each column and return it in a single row?

My current query looks like this.

SELECT 
DISTINCT(shortdata.date),
shortdata.symbol,
shortdata.fee_rate,
shortdata.rebate_rate,
shortdata.short_shares_available,
tss.total_volume as share_volume,

FROM `db1.table1` as shortdata
  LEFT JOIN `db1.table2` as tss
    ON shortdata.symbol = tss.ticker
      AND shortdata.date = tss.date

WHERE shortdata.symbol in ('AMC')
ORDER BY date desc
LIMIT 25

Which is returning a query with 6 values for each identifier.

    date    symbol  fee_rate    rebate_rate short_shares_available  share_volume
2023-01-20  AMC     100.3354    -96.0154    200000  30313546
2023-01-20  AMC     97.206      -92.886     65000   30313546
2023-01-20  AMC     100.3354    -96.0154    200000  489689
2023-01-20  AMC     97.206      -92.886     65000   31271340
2023-01-20  AMC     100.3354    -96.0154    200000  31271340
2023-01-20  AMC     97.206      -92.886     65000   489689
2023-01-19  AMC     122.3875    -118.0675   300000  29182367
2023-01-19  AMC     117.3614    -113.0414   300000  29734773
2023-01-19  AMC     113.7761    -109.4561   300000  801000
2023-01-19  AMC     113.7761    -109.4561   300000  29182367
2023-01-19  AMC     122.3875    -118.0675   300000  801000
2023-01-19  AMC     113.7761    -109.4561   300000  29734773
2023-01-19  AMC     122.3875    -118.0675   300000  29734773
2023-01-19  AMC     117.3614    -113.0414   300000  29182367
2023-01-19  AMC     117.3614    -113.0414   300000  801000
2023-01-18  AMC     106.2183    -101.8983   150000  2294874
2023-01-18  AMC     106.2183    -101.8983   150000  61230037
2023-01-18  AMC     106.2183    -101.8983   150000  62117798
2023-01-17  AMC     105.4728    -101.1528   100000  57591052
2023-01-17  AMC     105.4436    -101.1236   150000  759340
2023-01-17  AMC     120.211     -115.891    150000  759340
2023-01-17  AMC     105.4436    -101.1236   150000  57591052
2023-01-17  AMC     105.4728    -101.1528   100000  56101661
2023-01-17  AMC     105.4436    -101.1236   150000  56101661
2023-01-17  AMC     107.1329    -102.8129   200000  57591052

The expected end result would be something like:

date        symbol  fee_rate    rebate_rate short_shares_available  share_volume
2023-01-20  AMC     100.3354    -96.0154    200000  31271340
2023-01-19  AMC     122.3875    -118.0675   300000  29182367
2023-01-18  AMC     106.2183    -101.8983   150000  2294874
2023-01-17  AMC     107.1329    -102.8129   200000  57591052

Here is a sample query from each table:

db1

date        symbol  fee_rate    rebate_rate short_shares_available
2023-01-20  AMC     100.3354    -96.0154    200000
2023-01-20  AMC     97.206      -92.886     65000
2023-01-19  AMC     117.3614    -113.0414   300000
2023-01-19  AMC     113.7761    -109.4561   300000
2023-01-19  AMC     122.3875    -118.0675   300000
2023-01-18  AMC     106.2183    -101.8983   150000
2023-01-17  AMC     107.1329    -102.8129   200000
2023-01-17  AMC     105.4728    -101.1528   100000
2023-01-17  AMC     105.4436    -101.1236   150000
2023-01-17  AMC     120.211     -115.891    150000

db2 Note: tape time is only to show where one could get most recent value. Tape time on the 21st will be the last entry from the 20th, etc.

date        ticker  tape_time                       share_volume
2023-01-20  AMC     2023-01-21 00:59:54.000000 UTC  31271340
2023-01-20  AMC     2023-01-20 14:28:56.000000 UTC  489689
2023-01-20  AMC     2023-01-20 20:59:58.000000 UTC  30313546
2023-01-19  AMC     2023-01-19 14:29:56.000000 UTC  801000
2023-01-19  AMC     2023-01-19 20:59:58.000000 UTC  29182367
2023-01-19  AMC     2023-01-20 00:59:45.000000 UTC  29734773
2023-01-18  AMC     2023-01-19 00:58:06.000000 UTC  62117798
2023-01-18  AMC     2023-01-18 20:59:59.000000 UTC  61230037
2023-01-18  AMC     2023-01-18 14:29:40.000000 UTC  2294874
2023-01-17  AMC     2023-01-18 00:59:42.000000 UTC  57591052

I have found snippets on a theoretical how this could be completed, but I'm unable to connect the dots further.

Any help would be appreciated.

Upvotes: 0

Views: 166

Answers (2)

Prajna Rai T
Prajna Rai T

Reputation: 1820

You can consider the approach below for your requirement.

SELECT distinct d1.date, d1.symbol, d1.fee_rate, d1.rebate_rate, 
d1.short_shares_available, d2.share_volume
FROM `project.dataset.db1` d1 left join `project.dataset.db2` d2
on d1.symbol=d2.ticker and d1.date = d2.date

WHERE true
QUALIFY 1 = ROW_NUMBER() OVER (PARTITION BY d1.date ORDER BY d1.date desc)

Upvotes: 1

Belayer
Belayer

Reputation: 14861

The initial issue seems to be a misunderstanding of DISTINCT. It is not a function and no matter how parenthesized it always applies to the entire row. Duplicate rows are removed but if a single column is different then it is not a duplicate row and therefore not removed. This is the issue you have. But Postgres offers a solution: DISTINCT ON(...). When used it returns only the first row of set of rows containing the specified value(s). In your case perhaps:

SELECT DISTINCT on(shortdata.date),    --- will select just the first row for each date
       shortdata.date                  ---    (Not selected by above, repeat to return)
       shortdata.symbol,
       shortdata.fee_rate,
       shortdata.rebate_rate,
       shortdata.short_shares_available,
       tss.total_volume as share_volume,
FROM db1.table1 as shortdata
LEFT JOIN db1.table2 as tss
  ON shortdata.symbol = tss.ticker
 AND shortdata.date = tss.date
WHERE shortdata.symbol in ('AMC')
ORDER BY shortdata.date desc                --- based upon this criteria
LIMIT 25;
    

Upvotes: 0

Related Questions