Chris
Chris

Reputation: 2071

SQL - LEFT JOIN with the most recent data given id and date

I'm working in SQL Server Management Studio, and I'm trying to find a way to extract (with LEFT JOIN) the most recent data available from table2 given a month and an id in table1. table1 is the purchases given a date and a Client_ID:

enter image description here

You can see in table1 that we have 3 clients, each purchased a certain product in the year and month specified in YEAR_MONTH. Now, I need some data from those clients, and for this I need to go to table2:

enter image description here

To extract the DATA columns (assume there are 100 columns like DATA). I'm looking for a way to extract the most recent data for a given Client_ID in a YEAR_MONTH of table1. In addition, this have a big constraint (here comes the hard part):

The lag/delay: The difference between YEAR_MONTH of table1 and table2 has to be minimum 1 month and maximum 7 months. Some examples:

I have gone this so far:

    SELECT *
FROM (
    SELECT ROW_NUMBER() OVER (PARTITION BY Client_ID ORDER BY YEAR_MONTH DESC) AS Recency_ID, table2.*
    FROM table2) t
WHERE Recency_ID=1

Where Recency_ID=1 is the most recent data for that Client_ID, Recency_ID=2 is the second most recent data available for that Client_ID, etc. However, this solution doesn't respect the delay, and it will give the most recent data available from table2 no matter the YEAR_MONTH of table1 (or the lag/delay constrain). This is the output expected:

enter image description here

Upvotes: 1

Views: 833

Answers (2)

seanb
seanb

Reputation: 6685

As you say, the more tricky component is to work out the date ranges.

In the answer below, I started with each of the rows from Table1, then calculated the minimum and maximum Year_Months allowed - these provide the 'bounds' when joining to Table2.

Note that the horrible-looking statement

CAST(FORMAT(DATEADD(month, -7, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Min_YM
  • First gets the Year_Month from Table1 (with original value as an int)
  • Converts it to a date using the first of the month
  • Subtracts 7 months
  • Then converts it back to an int in the format used for Year_Month

The reason to convert to dates is for the 12-month cycle per year. Otherwise, if you were using straight ints and started with '202001' and you subtracted 7, you would result in '201994' rather than '201906'.

Ideally, you'd be working with dates not Year_Months, but given the current dataset, the above works pretty well (as in, the calculations are easy and then the joins to Table2 need no other conversions).

Then a similar query to Gordon's above pulls the more recent values from Table2 that within the 1 month/7 month bounds.

By the way, I worked out the answer to my question in comments - the 1 and 7 months are not inclusive - you actually want dates that occur before or after those.

WITH Table1_DateRanges AS
    (SELECT Client_ID, Year_Month, Product,
            CAST(FORMAT(DATEADD(month, -7, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Min_YM,
            CAST(FORMAT(DATEADD(month, -1, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Max_YM
    FROM    Table1
    )
SELECT  t1.Client_ID, t1.Year_Month, t1.Product, t2.Year_Month AS Year_Month_Table2, t2.Data
FROM    Table1_DateRanges t1 
        OUTER APPLY
             (SELECT TOP 1 t2.Year_Month, t2.Data
              FROM   Table2 t2
              WHERE  t1.Client_Id = t2.Client_Id 
                     AND t1.Min_YM < t2.Year_Month
                     AND t1.Max_YM > t2.Year_Month
              ORDER by t2.year_month desc
             ) t2
ORDER BY t1.Client_ID, t1.Year_Month DESC;

An alternate version (not using OUTER APPLY but using a LEFT JOIN and ROW_NUMBER() is as follows.

WITH Table1_DateRanges AS
    (SELECT Client_ID, Year_Month, Product,
            CAST(FORMAT(DATEADD(month, -7, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Min_YM,
            CAST(FORMAT(DATEADD(month, -1, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Max_YM
    FROM    Table1
    ),
PossibleResults AS
    (SELECT  t1.Client_ID, t1.Year_Month, t1.Product, t2.Year_Month AS Year_Month_Table2, t2.Data,
            ROW_NUMBER() OVER (PARTITION BY t1.Client_ID, t1.Year_Month ORDER BY t2.Year_Month DESC) AS rn
    FROM    Table1_DateRanges t1 
            LEFT OUTER JOIN Table2 t2
               ON t1.Client_Id = t2.Client_Id 
               AND t1.Min_YM < t2.Year_Month
               AND t1.Max_YM > t2.Year_Month
     )
SELECT   Client_Id, Year_Month, Product, Year_Month_Table2, Data
FROM PossibleResults
WHERE rn = 1
ORDER BY Client_ID, Year_Month DESC;

Both of these result in the following

Client_ID  Year_Month  Product      Year_Month_Table2  Data
1          202001      anyproduct1  201911             whatever3
2          202001      anyproduct2  201907             whatever7
2          201905      anyproduct3  201903             whatever8
3          202004      anyproduct4  202001             whatever12
3          201905      anyproduct5  NULL               NULL
3          201707      anyproduct6  201705             whatever16

Here is a db<>fiddle with data setup and both approaches.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269693

You can use apply for this. I think you want:

select t1.*, t2.*
from table1 t1 outer apply
     (select top (1) t2.*
      from table2 t2
      where t2.client_id = t1.client_id and
            t2.year_month <= t1.year_month
      order by t2.year_month desc
     ) t2;

This fetches the most recent table2 row for each table1 row, where the clients match.

Upvotes: 1

Related Questions