lucy
lucy

Reputation: 4506

Get the last state of record if it follow correct sequence

In a data table, data is stored based on the production state and shipment of a product. This means product states are

product_created
packed
shipped
delivered

The packed and shipped states are inserted from a legacy system where these states can come late, even after the box has been delivered status. I want to take the last state of the product if the product journey follows the correct sequence (product_created -> packed -> shipped -> delivered)

It should also support duplicate records, for example, product_created -> packed -> packed -> packed -> shipped -> shipped -> delivered. In this case, the sequence is correct, so it should be taken into consideration.

input table

PRODUCT_ID  STATE       INSERTION_TIME
1   product_created 2023-01-10 07:00:00
1   product_created 2023-01-10 09:00:00
1   packed          2023-01-11 01:00:00
1   packed          2023-01-11 02:00:00
1   packed          2023-01-11 09:00:00
1   shipped         2023-01-12 01:00:00
1   delivered       2023-01-12 02:00:00
2   product_created 2023-01-10 07:00:44
2   packed          2023-01-11 01:00:00
2   delivered       2023-01-11 09:00:00
2   shipped         2023-01-12 02:00:00
3   product_created 2023-01-10 07:00:00
3   packed          2023-01-11 01:00:00
3   product_created 2023-01-11 09:00:00
3   packed          2023-01-11 09:00:00
3   shipped         2023-01-12 01:00:00
3   delivered       2023-01-12 02:00:00

Output

PRODUCT_ID  STATE   INSERTION_TIME
1         delivered 2023-01-12 02:00:00

Above result does not content PRODUCT_ID =2 and 3 because it contains incorrect sequence.

My query gets the final state of a product, but I don't know how to exclude products with incorrect sequences

SELECT * FROM datatable
QUALIFY ROW_NUMBER() OVER ( PARTITION BY PRODUCT_ID ORDER BY INSERTION_TIME DESC) = 1

This query only gives the last state of each product but it may contains some product which have incorrect sequence.

Upvotes: 1

Views: 85

Answers (2)

Rajat
Rajat

Reputation: 5803

Assuming there can only be 4 distinct patterns for correct sequences, we can start by taking care of the duplicated states by keeping the most recent entry per product_id and state.

In the final select we can pick the last state per product_id on the condition that all its states when aggregated and ordered by insertion_time match one of the valid sequences.

with cte as
(select *
 from dataTable
 qualify row_number() over (partition by product_id, state order by insertion_time desc)=1)

select * 
from cte
qualify row_number() over (partition by product_id order by insertion_time desc) = 1 and
        listagg(state,',') within group (order by insertion_time asc) over (partition by product_id) in 
            ('product_created,packed,shipped,delivered',
             'product_created,packed,shipped',
             'product_created,packed',
             'product_created')

Upvotes: 0

Lukasz Szozda
Lukasz Szozda

Reputation: 175556

SQL allows to detect patterns across multiple rows using MATCH_RECOGNIZE clause.

Recognizes matches of a pattern in a set of rows. MATCH_RECOGNIZE accepts a set of rows (from a table, view, subquery, or other source) as input, and returns all matches for a given row pattern within this set. The pattern is defined similarly to a regular expression.

MATCH_RECOGNIZE is typically used to detect events in time series.

Here searching for pattern of PATTERN (^ product_created+ packed+ shipped+ delivered+ $)

  • ^ start of partition
  • + indicates 1 or more occurrences
  • * indicates 0 or more occurrences(if needed)
  • $ end of partition

All rows per group:

SELECT *
FROM dataTable
MATCH_RECOGNIZE (
   PARTITION BY PRODUCT_ID
   ORDER BY INSERTION_TIME
   MEASURES
    classifier() as cl
   ALL ROWS PER MATCH
   PATTERN  (^ product_created+  packed+  shipped+  delivered+ $)  
   DEFINE 
       product_created AS STATE = 'product_created'
      ,packed AS STATE = 'packed'
      ,shipped AS STATE = 'shipped'
      ,delivered AS STATE = 'delivered'
) AS mr; 

Single row per group:

SELECT *
FROM dataTable
MATCH_RECOGNIZE (
   PARTITION BY PRODUCT_ID
   ORDER BY INSERTION_TIME
   MEASURES
    classifier() as cl,
    LAST(INSERTION_TIME) AS insertion_time
   PATTERN  (^ product_created+  packed+  shipped+  delivered+ $)  
   DEFINE 
       product_created AS STATE = 'product_created'
      ,packed AS STATE = 'packed'
      ,shipped AS STATE = 'shipped'
      ,delivered AS STATE = 'delivered'
) AS mr;

Output:

enter image description here

MATCH_RECOGNIZE db<>fiddle demo

Upvotes: 1

Related Questions