Louise
Louise

Reputation: 351

Self joining next date in table as another date field

Trying to work with an exchange rate table and a transactions table, and how the transactions are joined to the exchange table depends on when the most recent exchange rate for that currency was relative to the transaction. The table contains many duplicates and many types of currencies and other fields not relevant to this issue.

I plan on joining the transactions by using BETWEEN for the start date and end date for each currency rate and the date of the transaction, however I don't have an end date field. The exchange rate for a currency ends when the next one starts.

For example:

Currency Start Date Rate
EUR 01/12/2021 1.25
US 01/12/2021 0.75
EUR 25/12/2021 1.11
US 10/12/2021 0.8
EUR 01/12/2021 1.25
US 25/12/2021 1.11

Should become:

Currency Start Date Rate End Date
EUR 01/12/2021 1.25 24/12/2021
US 01/12/2021 0.75 09/12/2021
EUR 25/12/2021 1.11 today
US 10/12/2021 0.8 today

I was thinking of:

with ordered_currency as (
select distinct Currency, Start_date
from exchange_rate_table
order by currency, start_date asc
)

This would produce a table of all currencies in order of date, removing any duplicates.

The next step would be to check if the next line is the same currency, if it is then take it's start date, as the end date of the current line. If it's not the same currency then simply put current_date() as the end_date, and could then join this End_date field back onto the original table.

However, I'm not sure of the syntax which would be used to evaluate whether the line below has the same field (in this case currency)

Any help would be appreciated!

Upvotes: 1

Views: 57

Answers (2)

Simeon Pilgrim
Simeon Pilgrim

Reputation: 25903

So another method, that doesn't use JOINS is:

with a CTE for a input data:

WITH fake_data(currency, start_date, rate) AS (
    SELECT column1, to_date(column2, 'dd/mm/yyyy'), column3 
    FROM VALUES 
        ('EUR','01/12/2021', 1.25),
        ('US','01/12/2021', 0.75),
        ('EUR','25/12/2021', 1.11),
        ('US','10/12/2021', 0.8),
        ('EUR','01/12/2021', 1.25),
        ('US','25/12/2021', 1.11)
)

then to get the end_time for each row (same as Caius's answer):

SELECT currency
    ,start_date
    ,rate
    ,lead(start_date,1,current_date) over (order by start_date) as end_date
FROM fake_data
ORDER BY 2;

gives:

CURRENCY START_DATE RATE END_DATE
EUR 2021-12-01 1.25 2021-12-01
US 2021-12-01 0.75 2021-12-01
EUR 2021-12-01 1.25 2021-12-10
US 2021-12-10 0.8 2021-12-25
EUR 2021-12-25 1.11 2021-12-25
US 2021-12-25 1.11 2022-02-02

But to now chain the results together we want to filer out streaks of same, which we can do with a LAG and QUALIFY

SELECT currency
    ,start_date
    ,rate
    ,lag(currency) over (order by start_date) as lag_currency
FROM fake_data
QUALIFY lag_currency IS NULL OR lag_currency != currency
ORDER BY 2;

and given the prior "end on the row is not useful in that context, we can drop it, but now we want the next entry time to be the end_time of the streak

SELECT currency
    ,start_date
    ,rate
    ,lead(start_date,1,current_date) over (order by start_date) as end_date
FROM (
    SELECT currency
        ,start_date
        ,rate
        ,lag(currency) over (order by start_date) as lag_currency
    FROM fake_data
)
WHERE lag_currency IS NULL OR lag_currency != currency
ORDER BY 2;

I swapped from QUALIFY to an outer WHERE clause, because it's just seems common usage when using nested SELECTs

CURRENCY START_DATE RATE END_DATE
EUR 2021-12-01 1.25 2021-12-01
US 2021-12-01 0.75 2021-12-01
EUR 2021-12-01 1.25 2021-12-10
US 2021-12-10 0.8 2021-12-25
EUR 2021-12-25 1.11 2021-12-25
US 2021-12-25 1.11 2022-02-02

Now there are some things the JOIN method allows, like aggregation of count of streak rows and values on those, but we can do that here also, we just do a couple of extra steps.

WITH fake_data(currency, start_date, rate) AS (
    SELECT column1, to_date(column2, 'dd/mm/yyyy'), column3 
    FROM VALUES 
        ('EUR','01/12/2021', 1.25),
        ('EUR','02/12/2021', 1.25),
        ('EUR','03/12/2021', 1.25),
        ('US','04/12/2021', 0.75),
        ('US','10/12/2021', 0.8),
        ('EUR','23/12/2021', 1.11),
        ('EUR','24/12/2021', 1.11),
        ('US','25/12/2021', 1.11)
)
SELECT a.*
    ,nvl(streak_start, lag(streak_start)ignore nulls over(order by start_date)) as streak_date
FROM (
    SELECT currency
        ,start_date
        ,rate
        ,lag(currency) over (order by start_date) as lag_currency
        ,lag_currency IS NULL OR lag_currency != currency as new_row
        ,iff(new_row, start_date, null) as streak_start
    FROM fake_data
) AS a
ORDER BY 2;

this gives a grouping token streak_date

CURRENCY START_DATE RATE LAG_CURRENCY NEW_ROW STREAK_START STREAK_DATE
EUR 2021-12-01 1.25 TRUE 2021-12-01 2021-12-01
EUR 2021-12-02 1.25 EUR FALSE 2021-12-01
EUR 2021-12-03 1.25 EUR FALSE 2021-12-01
US 2021-12-04 0.75 EUR TRUE 2021-12-04 2021-12-04
US 2021-12-10 0.8 US FALSE 2021-12-04
EUR 2021-12-23 1.11 US TRUE 2021-12-23 2021-12-23
EUR 2021-12-24 1.11 EUR FALSE 2021-12-23
US 2021-12-25 1.11 EUR TRUE 2021-12-25 2021-12-25

thus you can then wrap that and do any aggregation math you like:

WITH fake_data(currency, start_date, rate) AS (
    SELECT column1, to_date(column2, 'dd/mm/yyyy'), column3 
    FROM VALUES 
        ('EUR','01/12/2021', 1.25),
        ('EUR','02/12/2021', 1.26),
        ('EUR','03/12/2021', 1.27),
        ('US','04/12/2021', 0.75),
        ('US','10/12/2021', 0.8),
        ('EUR','23/12/2021', 1.11),
        ('EUR','24/12/2021', 1.14),
        ('US','25/12/2021', 1.11)
)
SELECT b.currency
    ,min(b.start_date) as start_date
    ,any_value(streak_end) as end_date
    ,count(*) as streak_count
    ,avg(rate) as avg_rate
    ,min(rate) as min_rate
    ,max(rate) as max_rate
FROM (
    SELECT a.*
        ,nvl(streak_start, lag(streak_start)ignore nulls over(order by start_date)) as streak_date
        ,lead(streak_start,1,current_date) ignore nulls over (order by start_date) as streak_end
    FROM (
        SELECT currency
            ,start_date
            ,rate
            ,lag(currency) over (order by start_date) as lag_currency
            ,lag_currency IS NULL OR lag_currency != currency as new_row
            ,iff(new_row, start_date, null) as streak_start
        FROM fake_data
    ) AS a
) as b
GROUP BY b.streak_date,1
ORDER BY 2;
CURRENCY START_DATE END_DATE STREAK_COUNT AVG_RATE MIN_RATE
EUR 2021-12-01 2021-12-04 3 1.25 1.25
US 2021-12-04 2021-12-23 2 0.775 0.75
EUR 2021-12-23 2021-12-25 2 1.11 1.11
US 2021-12-25 2022-02-02 1 1.11 1.11

Upvotes: 1

Caius Jard
Caius Jard

Reputation: 74605

however I don't have an end date field.

It's a lot easier if you do. You can "get the one from the next row" with LEAD:

WITH exchg_from_to AS(
    SELECT
      Currency,
      StartDate,
      LEAD(StartDate, 1, TO_DATE('9999-12-31')) OVER(PARTITION BY Currency ORDER BY StartDate), CAST('9999-12-31' as date) EndDate,
      Rate
    FROM
      exchange_rate_table
)


SELECT * 
FROM 
  tran t 
  JOIN exchg_from_to e 
  ON 
    t.TranDate >= e.StartDate AND 
    t.TranDate < e.EndDate AND
    t.Currency = e.Currency

You might have to tweak that TO_DATE a bit.. (I've never used SnowFlake, but the docs say LEAD is supported))

I don't recmmend using BETWEEN because it's inclusive both ends; if you have a tran that is bang on the date it'll match twice, once with the start date of the N row and once with the N-1 end date. Using a >= and < pair ensures no overlap

Upvotes: 2

Related Questions