Reputation: 41
As a simplified example, I need to select each instance where a customer had a shipping address that was different from their previous shipping address. So I have a large table with columns such as:
purchase_id | cust_id | date | address | description
-----------------------------------------------------------
1 | 5 | jan | address1 | desc1
2 | 6 | jan | address2 | desc2
3 | 5 | feb | address1 | desc3
4 | 6 | feb | address2 | desc4
5 | 5 | mar | address3 | desc5
6 | 5 | mar | address3 | desc6
7 | 5 | apr | address1 | desc7
8 | 6 | may | address4 | desc8
Note that customers can "move back" to a previous address as customer 5 did in row 7.
What I want to select (and as efficiently as possible as this is a quite large table) is the first row out of every 'block' wherein a customer had subsequent orders shipped to the same address. In this example that would be rows 1,2,5,7,and 8. In all the others, the customer has the same address as their previous order.
So effectively I want to first ORDER BY (cust_id, date)
, then SELECT purchase_id, cust_id, min(date), address, description
.
However I'm having trouble because SQL usualy requires GROUP BY
to be done before ORDER BY
. I can't therefore figure out how to adapt e.g. either of the top answers to this question (which I otherwise quite like.) It is necessary (conceptually, at least) to order by date before grouping or using aggregate functions like min()
, otherwise I would miss instances like row 7 in my example table, where a customer 'moved back' to a previous address.
Note also that two customers can share an address, so I need to effectively group by both cust_id
and address
after ordering by date.
I'm using snowflake which I believe has most of the same commands available as recent versions of PostgreSQL and SQL Server (although I'm fairly new to snowflake so not completely sure.)
Upvotes: 4
Views: 4395
Reputation: 26120
Yet more late options/opinions:
Given this is a edge detection, LAG/LEAD (depending which edge you are looking for) is the simplest tool.
Marcin's LAG option can be moved from a sub-select to a first level option, with QUALIFY.
SELECT *
FROM data_table
QUALIFY not equal_null(address, lag(address) over(partition by cust_id order by purchase_id))
ORDER BY 1
giving:
PURCHASE_ID | CUST_ID | DATE | ADDRESS | DESCRIPTION |
---|---|---|---|---|
1 | 5 | 2021-01-01 | address1 | desc1 |
2 | 6 | 2021-01-01 | address2 | desc2 |
5 | 5 | 2021-03-01 | address3 | desc5 |
7 | 5 | 2021-04-01 | address1 | desc7 |
8 | 6 | 2021-05-01 | address4 | desc8 |
Lukasz's CONDITIONAL_CHANGE_EVENT is a very nice solution, but CONDITIONAL_CHANGE_EVENT is not just finding a change edge but enumerating them, so if you we looking for the 5th change, or such, then CONDITIONAL_CHANGE_EVENT saves you having to chain a LAG/LEAD with a ROW_NUMBER(). And as such, you cannot collapse that solution into a single block:
like:
ROW_NUMBER() OVER(PARTITION BY CUST_ID, CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) ORDER BY DATE) = 1
because the implicit row_number inside CONDITIONAL_CHANGE_EVENT generates the error:
Window function x may not be nested inside another window function.
Upvotes: 0
Reputation: 176224
Snowflake has introduced CONDITIONAL_CHANGE_EVENT, which ideally solves described case:
Returns a window event number for each row within a window partition when the value of the argument expr1 in the current row is different from the value of expr1 in the previous row. The window event number starts from 0 and is incremented by 1 to indicate the number of changes so far within that window
Data preparation:
CREATE OR REPLACE TABLE t(purchase_id INT, cust_id INT,
date DATE, address TEXT, description TEXT);
INSERT INTO t(purchase_id, cust_id, date, address, description)
VALUES
( 1, 5, '2021-01-01'::DATE ,'address1','desc1')
,( 2, 6, '2021-01-01'::DATE ,'address2','desc2')
,( 3, 5, '2021-02-01'::DATE ,'address1','desc3')
,( 4, 6, '2021-02-01'::DATE ,'address2','desc4')
,( 5, 5, '2021-03-01'::DATE ,'address3','desc5')
,( 6, 5, '2021-03-01'::DATE ,'address3','desc6')
,( 7, 5, '2021-04-01'::DATE ,'address1','desc7')
,( 8, 6, '2021-05-01'::DATE ,'address4','desc8');
Query:
SELECT *,
CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
ORDER BY purchase_id;
Once the subgroup: CCE
column is identified, QUALIFY could be used to find the first row per each CUST_ID, CCE
.
Full query:
WITH cte AS (
SELECT *,
CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
)
SELECT *
FROM cte
QUALIFY ROW_NUMBER() OVER(PARTITION BY CUST_ID, CCE ORDER BY DATE) = 1
ORDER BY purchase_id;
Output:
Upvotes: 1
Reputation: 4739
Sorry for a late reply. I meant to react to this post a few days ago.
The "most proper" way I can think of is to use the LAG function.
Take this:
select purchase_id, cust_id, address,
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address
from x order by cust_id, purchase_id;
-------------+---------+----------+--------------+
PURCHASE_ID | CUST_ID | ADDRESS | PREV_ADDRESS |
-------------+---------+----------+--------------+
1 | 5 | address1 | [NULL] |
3 | 5 | address1 | address1 |
5 | 5 | address3 | address1 |
6 | 5 | address3 | address3 |
7 | 5 | address1 | address3 |
2 | 6 | address2 | [NULL] |
4 | 6 | address2 | address2 |
8 | 6 | address4 | address2 |
-------------+---------+----------+--------------+
And then you can easily detect rows with the events like you described
select purchase_id, cust_id, address, prev_address from (
select purchase_id, cust_id, address,
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address
from x
) sub
where not equal_null(address, prev_address)
order by cust_id, purchase_id;
-------------+---------+----------+--------------+
PURCHASE_ID | CUST_ID | ADDRESS | PREV_ADDRESS |
-------------+---------+----------+--------------+
1 | 5 | address1 | [NULL] |
5 | 5 | address3 | address1 |
7 | 5 | address1 | address3 |
2 | 6 | address2 | [NULL] |
8 | 6 | address4 | address2 |
-------------+---------+----------+--------------+
Note that I'm using EQUAL_NULL function to have NULL=NULL semantics.
Note that the LAG function can be computationally intensive though (but comparable with using ROW_NUMBER proposed earlier)
Upvotes: 2
Reputation: 35790
You can use row_number
window function to do the trick:
;with cte as(select *, row_number() over(partition by cust_id, address
order by purchase_id) as rn from table)
select * from cte
where rn = 1
Upvotes: 1
Reputation: 11245
This would probably be best solved by a subquery to get the first purchase for each user, then using IN
to filter rows based on that result.
To clarify, purchase_id
is an autoincrement column, correct? If so, a purchase with a higher purchase_id
must have been created at a later date, and the following should suffice:
SELECT *
FROM purchases
WHERE purchase_id IN (
SELECT MIN(purchase_id) AS first_purchase_id
FROM purchases
GROUP BY cust_id
)
If you only want the first purchase for customers with more than one address, add a HAVING
clause to your subquery:
SELECT *
FROM purchases
WHERE purchase_id IN (
SELECT MIN(purchase_id) AS first_purchase_id
FROM purchases
GROUP BY cust_id
HAVING COUNT(DISTINCT address) > 1
)
Fiddle: http://sqlfiddle.com/#!9/12d75/6
However, if purchase_id
is NOT an autoincrement column, then SELECT
on both cust_id
and min(date)
on your subquery and use an INNER JOIN
on cust_id
and min(date)
:
SELECT *
FROM purchases
INNER JOIN (
SELECT cust_id, MIN(date) AS min_date
FROM purchases
GROUP BY cust_id
HAVING COUNT(DISTINCT address) > 1
) cust_purchase_date
ON purchases.cust_id = cust_purchase_date.cust_id AND purchases.date = cust_purchase_date.min_date
The first query example will probably be faster, however, so use that if you purchase_id
is an autoincrement column.
Upvotes: 0