sqlgroup-bysql-order-bysnowflake-cloud-data-platform

Reputation: 41

SQL: order by, then select first row with distinct value for multiple columns

As a simplified example, I need to select each instance where a customer had a shipping address that was different from their previous shipping address. So I have a large table with columns such as:

purchase_id | cust_id | date | address  | description
-----------------------------------------------------------
 1          | 5       | jan  | address1 | desc1
 2          | 6       | jan  | address2 | desc2
 3          | 5       | feb  | address1 | desc3
 4          | 6       | feb  | address2 | desc4
 5          | 5       | mar  | address3 | desc5
 6          | 5       | mar  | address3 | desc6
 7          | 5       | apr  | address1 | desc7
 8          | 6       | may  | address4 | desc8

Note that customers can "move back" to a previous address as customer 5 did in row 7.

What I want to select (and as efficiently as possible as this is a quite large table) is the first row out of every 'block' wherein a customer had subsequent orders shipped to the same address. In this example that would be rows 1,2,5,7,and 8. In all the others, the customer has the same address as their previous order.

So effectively I want to first ORDER BY (cust_id, date), then SELECT purchase_id, cust_id, min(date), address, description.

However I'm having trouble because SQL usualy requires GROUP BY to be done before ORDER BY. I can't therefore figure out how to adapt e.g. either of the top answers to this question (which I otherwise quite like.) It is necessary (conceptually, at least) to order by date before grouping or using aggregate functions like min(), otherwise I would miss instances like row 7 in my example table, where a customer 'moved back' to a previous address.

Note also that two customers can share an address, so I need to effectively group by both cust_id and address after ordering by date.

I'm using snowflake which I believe has most of the same commands available as recent versions of PostgreSQL and SQL Server (although I'm fairly new to snowflake so not completely sure.)

Upvotes: 4

Answers (5)

Simeon Pilgrim

Reputation: 26120

Yet more late options/opinions:

Given this is a edge detection, LAG/LEAD (depending which edge you are looking for) is the simplest tool.

Marcin's LAG option can be moved from a sub-select to a first level option, with QUALIFY.

Where the NOT and EQUAL_NULL adds value is if there was a null address the first LAG would also return null, those would be not equal, and on flipping, become true. So EQUAL_NULL safe compare catches that nicely.

SELECT * 
FROM data_table 
QUALIFY not equal_null(address, lag(address) over(partition by cust_id order by purchase_id))
ORDER BY 1

giving:

PURCHASE_ID	CUST_ID	DATE	ADDRESS	DESCRIPTION
1	5	2021-01-01	address1	desc1
2	6	2021-01-01	address2	desc2
5	5	2021-03-01	address3	desc5
7	5	2021-04-01	address1	desc7
8	6	2021-05-01	address4	desc8

Lukasz's CONDITIONAL_CHANGE_EVENT is a very nice solution, but CONDITIONAL_CHANGE_EVENT is not just finding a change edge but enumerating them, so if you we looking for the 5th change, or such, then CONDITIONAL_CHANGE_EVENT saves you having to chain a LAG/LEAD with a ROW_NUMBER(). And as such, you cannot collapse that solution into a single block:

like:

 ROW_NUMBER() OVER(PARTITION BY CUST_ID, CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) ORDER BY DATE) = 1

because the implicit row_number inside CONDITIONAL_CHANGE_EVENT generates the error:

Window function x may not be nested inside another window function.

Upvotes: 0

Lukasz Szozda

Reputation: 176224

Snowflake has introduced CONDITIONAL_CHANGE_EVENT, which ideally solves described case:

Returns a window event number for each row within a window partition when the value of the argument expr1 in the current row is different from the value of expr1 in the previous row. The window event number starts from 0 and is incremented by 1 to indicate the number of changes so far within that window

Data preparation:

CREATE OR REPLACE TABLE t(purchase_id INT, cust_id INT,
                          date DATE, address TEXT, description TEXT);

INSERT INTO t(purchase_id, cust_id, date, address, description)
VALUES 
 ( 1, 5, '2021-01-01'::DATE ,'address1','desc1')
,( 2, 6, '2021-01-01'::DATE ,'address2','desc2')
,( 3, 5, '2021-02-01'::DATE ,'address1','desc3')
,( 4, 6, '2021-02-01'::DATE ,'address2','desc4')
,( 5, 5, '2021-03-01'::DATE ,'address3','desc5')
,( 6, 5, '2021-03-01'::DATE ,'address3','desc6')
,( 7, 5, '2021-04-01'::DATE ,'address1','desc7')
,( 8, 6, '2021-05-01'::DATE ,'address4','desc8');

Query:

SELECT *, 
 CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
ORDER BY purchase_id;

Once the subgroup: CCE column is identified, QUALIFY could be used to find the first row per each CUST_ID, CCE.

Full query:

WITH cte AS (
 SELECT *,
  CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
 FROM t
)
SELECT *
FROM  cte
QUALIFY ROW_NUMBER() OVER(PARTITION BY CUST_ID, CCE ORDER BY DATE) = 1
ORDER BY purchase_id;

Output:

Upvotes: 1

Marcin Zukowski

Reputation: 4739

Sorry for a late reply. I meant to react to this post a few days ago.

The "most proper" way I can think of is to use the LAG function.

Take this:

select purchase_id, cust_id, address, 
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address 
from x order by cust_id, purchase_id;
-------------+---------+----------+--------------+
 PURCHASE_ID | CUST_ID | ADDRESS  | PREV_ADDRESS |
-------------+---------+----------+--------------+
 1           | 5       | address1 | [NULL]       |
 3           | 5       | address1 | address1     |
 5           | 5       | address3 | address1     |
 6           | 5       | address3 | address3     |
 7           | 5       | address1 | address3     |
 2           | 6       | address2 | [NULL]       |
 4           | 6       | address2 | address2     |
 8           | 6       | address4 | address2     |
-------------+---------+----------+--------------+

And then you can easily detect rows with the events like you described

select purchase_id, cust_id, address, prev_address from (
  select purchase_id, cust_id, address, 
  lag(address, 1) over (partition by cust_id order by purchase_id) prev_address 
  from x 
) sub 
where not equal_null(address, prev_address)
order by cust_id, purchase_id;
-------------+---------+----------+--------------+
 PURCHASE_ID | CUST_ID | ADDRESS  | PREV_ADDRESS |
-------------+---------+----------+--------------+
 1           | 5       | address1 | [NULL]       |
 5           | 5       | address3 | address1     |
 7           | 5       | address1 | address3     |
 2           | 6       | address2 | [NULL]       |
 8           | 6       | address4 | address2     |
-------------+---------+----------+--------------+

Note that I'm using EQUAL_NULL function to have NULL=NULL semantics.

Note that the LAG function can be computationally intensive though (but comparable with using ROW_NUMBER proposed earlier)

Upvotes: 2

Giorgi Nakeuri

Reputation: 35790

You can use row_number window function to do the trick:

;with cte as(select *, row_number() over(partition by cust_id, address
                                         order by purchase_id) as rn from table)
select * from cte 
where rn = 1

Upvotes: 1

Anthony E

Reputation: 11245

This would probably be best solved by a subquery to get the first purchase for each user, then using IN to filter rows based on that result.

To clarify, purchase_id is an autoincrement column, correct? If so, a purchase with a higher purchase_id must have been created at a later date, and the following should suffice:

SELECT *
FROM purchases
WHERE purchase_id IN (
  SELECT MIN(purchase_id) AS first_purchase_id
  FROM purchases
  GROUP BY cust_id
)

If you only want the first purchase for customers with more than one address, add a HAVING clause to your subquery:

SELECT *
FROM purchases
WHERE purchase_id IN (
  SELECT MIN(purchase_id) AS first_purchase_id
  FROM purchases
  GROUP BY cust_id
  HAVING COUNT(DISTINCT address) > 1
)

Fiddle: http://sqlfiddle.com/#!9/12d75/6

However, if purchase_id is NOT an autoincrement column, then SELECT on both cust_id and min(date) on your subquery and use an INNER JOIN on cust_id and min(date):

SELECT *
FROM purchases
INNER JOIN (
  SELECT cust_id, MIN(date) AS min_date
  FROM purchases
  GROUP BY cust_id
  HAVING COUNT(DISTINCT address) > 1
) cust_purchase_date
ON purchases.cust_id = cust_purchase_date.cust_id AND purchases.date = cust_purchase_date.min_date

The first query example will probably be faster, however, so use that if you purchase_id is an autoincrement column.

Upvotes: 0

SQL: order by, then select first row with distinct value for multiple columns

Answers (5)

Related Questions