Reputation: 2071
I'm working in SQL Server Management Studio, and I'm trying to find a way to extract (with LEFT JOIN
) the most recent data available from table2
given a month and an id in table1
. table1
is the purchases given a date and a Client_ID
:
You can see in table1
that we have 3 clients, each purchased a certain product in the year and month specified in YEAR_MONTH
. Now, I need some data from those clients, and for this I need to go to table2
:
To extract the DATA
columns (assume there are 100 columns like DATA
). I'm looking for a way to extract the most recent data for a given Client_ID
in a YEAR_MONTH
of table1
. In addition, this have a big constraint (here comes the hard part):
The lag/delay: The difference between YEAR_MONTH
of table1
and table2
has to be minimum 1 month and maximum 7 months. Some examples:
Client_ID=1
and YEAR_MONTH=2020001
in table1
, we need to extract the most recent data from 201906 and 201911 both included. Which in this case is 202011 so we will have "whatever3" data in table 3
.Client_ID=2
and YEAR_MONTH=202001
in table1
, we need to extract the most recent data from 201906 and 201911 both included. Which in this case we have to options: 201912 and 201907. The most recent data is 201912, so we will have "whatever6" data in table3
.I have gone this so far:
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Client_ID ORDER BY YEAR_MONTH DESC) AS Recency_ID, table2.*
FROM table2) t
WHERE Recency_ID=1
Where Recency_ID=1
is the most recent data for that Client_ID
, Recency_ID=2
is the second most recent data available for that Client_ID
, etc. However, this solution doesn't respect the delay, and it will give the most recent data available from table2
no matter the YEAR_MONTH
of table1
(or the lag/delay constrain). This is the output expected:
Upvotes: 1
Views: 833
Reputation: 6685
As you say, the more tricky component is to work out the date ranges.
In the answer below, I started with each of the rows from Table1, then calculated the minimum and maximum Year_Months allowed - these provide the 'bounds' when joining to Table2.
Note that the horrible-looking statement
CAST(FORMAT(DATEADD(month, -7, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Min_YM
The reason to convert to dates is for the 12-month cycle per year. Otherwise, if you were using straight ints and started with '202001' and you subtracted 7, you would result in '201994' rather than '201906'.
Ideally, you'd be working with dates not Year_Months, but given the current dataset, the above works pretty well (as in, the calculations are easy and then the joins to Table2 need no other conversions).
Then a similar query to Gordon's above pulls the more recent values from Table2 that within the 1 month/7 month bounds.
By the way, I worked out the answer to my question in comments - the 1 and 7 months are not inclusive - you actually want dates that occur before or after those.
WITH Table1_DateRanges AS
(SELECT Client_ID, Year_Month, Product,
CAST(FORMAT(DATEADD(month, -7, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Min_YM,
CAST(FORMAT(DATEADD(month, -1, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Max_YM
FROM Table1
)
SELECT t1.Client_ID, t1.Year_Month, t1.Product, t2.Year_Month AS Year_Month_Table2, t2.Data
FROM Table1_DateRanges t1
OUTER APPLY
(SELECT TOP 1 t2.Year_Month, t2.Data
FROM Table2 t2
WHERE t1.Client_Id = t2.Client_Id
AND t1.Min_YM < t2.Year_Month
AND t1.Max_YM > t2.Year_Month
ORDER by t2.year_month desc
) t2
ORDER BY t1.Client_ID, t1.Year_Month DESC;
An alternate version (not using OUTER APPLY but using a LEFT JOIN and ROW_NUMBER() is as follows.
WITH Table1_DateRanges AS
(SELECT Client_ID, Year_Month, Product,
CAST(FORMAT(DATEADD(month, -7, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Min_YM,
CAST(FORMAT(DATEADD(month, -1, LTRIM(STR(Year_Month)) + '01'), 'yyyyMM') AS int) AS Max_YM
FROM Table1
),
PossibleResults AS
(SELECT t1.Client_ID, t1.Year_Month, t1.Product, t2.Year_Month AS Year_Month_Table2, t2.Data,
ROW_NUMBER() OVER (PARTITION BY t1.Client_ID, t1.Year_Month ORDER BY t2.Year_Month DESC) AS rn
FROM Table1_DateRanges t1
LEFT OUTER JOIN Table2 t2
ON t1.Client_Id = t2.Client_Id
AND t1.Min_YM < t2.Year_Month
AND t1.Max_YM > t2.Year_Month
)
SELECT Client_Id, Year_Month, Product, Year_Month_Table2, Data
FROM PossibleResults
WHERE rn = 1
ORDER BY Client_ID, Year_Month DESC;
Both of these result in the following
Client_ID Year_Month Product Year_Month_Table2 Data
1 202001 anyproduct1 201911 whatever3
2 202001 anyproduct2 201907 whatever7
2 201905 anyproduct3 201903 whatever8
3 202004 anyproduct4 202001 whatever12
3 201905 anyproduct5 NULL NULL
3 201707 anyproduct6 201705 whatever16
Here is a db<>fiddle with data setup and both approaches.
Upvotes: 1
Reputation: 1269693
You can use apply
for this. I think you want:
select t1.*, t2.*
from table1 t1 outer apply
(select top (1) t2.*
from table2 t2
where t2.client_id = t1.client_id and
t2.year_month <= t1.year_month
order by t2.year_month desc
) t2;
This fetches the most recent table2
row for each table1
row, where the clients match.
Upvotes: 1