jimiclapton
jimiclapton

Reputation: 889

Use of window function and subquery in Hive

I have a table in which one row represents an order. I'm trying to write a query that returns ALL customer orders for the year 2017, subsequent to where the 2nd order was placed in Jan 2017.

Initial code looks like this:

SELECT
order_date
,cust_id 
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1

This gives the following output, so far so good:

-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 |   123   |       1     |       4        |
| 2017-01-02 |   123   |       2     |       4        |
| 2017-01-05 |   123   |       3     |       4        |
| 2017-09-27 |   123   |       4     |       4        |
| 2017-02-02 |   456   |       1     |       3        |
| 2017-11-16 |   456   |       2     |       3        |
| 2017-12-04 |   456   |       3     |       3        |
| 2017-01-17 |   678   |       1     |       5        |
| 2017-01-30 |   678   |       2     |       5        |
| 2017-02-31 |   678   |       3     |       5        |
| 2017-05-26 |   678   |       4     |       5        |
| 2017-09-18 |   678   |       5     |       5        |

However, since I want to only retrieve order details subsequent to the 2nd order which has to have occurred in Jan 2017, I added some additional conditions so that the query is now as below:

    SELECT
    order_date
    ,cust_id 
    ,nth_booking
    ,total_bookings
    FROM (SELECT order_date
    ,order_id
    ,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
    ,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
    FROM my.orders
    WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
  WHERE 
  nth_booking >= 2
  AND order_date BETWEEN '2017-01-01' AND '2017-01-31'

This is clearly incorrect and I can of course see why when looking at the results below wherein the order_date condition is being met as declared:

-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-02 |   123   |       2     |       4        |
| 2017-01-05 |   123   |       3     |       4        |
| 2017-01-30 |   678   |       2     |       5        |

What I want, however, is more akin to this, where the 2nd order was placed in Jan 2017, yet I'm displaying all subsequent orders.

  -------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 |   123   |       2     |       4        |
| 2017-03-05 |   123   |       3     |       4        |
| 2017-09-27 |   123   |       4     |       4        |
| 2017-01-30 |   678   |       2     |       5        |
| 2017-02-31 |   678   |       3     |       5        |
| 2017-05-26 |   678   |       4     |       5        |
| 2017-09-18 |   678   |       5     |       5        |

How do I get to this view?

I would appreciate any guidance offered and hope I have provided sufficiently reproducible detail of my methodology and workings.

Thanks in advance

Upvotes: 0

Views: 1772

Answers (1)

leftjoin
leftjoin

Reputation: 38325

Calculate second_order_jan flag for cust_id and use it for filtering:

select
      order_date
     ,cust_id 
     ,nth_booking
     ,total_bookings 
from
( --calculate second_order_jan flag for the cust_id
select cust_id,
       order_date,
       order_id,
       nth_booking,
       total_bookings,
       max(case when month(order_date) = 1 and nth_booking=2 then 1 end) over (partition by cust_id) second_order_jan_flag  
from 
(
SELECT cust_id,
     order_date
    ,order_id
    ,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
    ,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
    FROM my.orders
    WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31'
) t1
) t2 where second_order_jan_flag =1
       and nth_booking >= 2 --Filter only orders after second.

Upvotes: 1

Related Questions