Nick de Silva
Nick de Silva

Reputation: 83

Athena SQL - Self-join subquery using outer column

So I have catalogued data in an S3 bucket which very similarly resembles this data here:

+-----+-----------+---------+-----------------------+
| id  | title     | event   | time                  |
+-----+-----------+---------+-----------------------+
|1    | book A    | BORROW  | 2018-07-01 09:00:00   |
|1    | book A    | RETURN  | 2018-08-01 09:00:00   |
|2    | book B    | BORROW  | 2018-08-01 13:00:00   |
|2    | book B    | RETURN  | 2018-10-01 17:00:00   |
|1    | book A    | BORROW  | 2018-11-01 09:00:00   |
|1    | book A    | RETURN  | 2018-12-01 09:00:00   |
+-----+-----------+---------------------------------+

I basically want to be able to write a SELECT statement in Amazon Athena which displays the borrow and return times next to one another in a row like this:

+-----+-----------+-----------------------+-----------------------+
| id  | title     | borrow_time           | return_time           |
+-----+-----------+-----------------------+-----------------------+
|1    | book A    | 2018-07-01 09:00:00   | 2018-08-01 09:00:00   |
|2    | book B    | 2018-08-01 13:00:00   | 2018-10-01 17:00:00   |
|1    | book A    | 2018-11-01 09:00:00   | 2018-12-01 09:00:00   |
+-----+-----------+-----------------------+-----------------------+

I've been spending heaps of time writing around 5 different queries (using things like OUTER APPLY, but Athena seems to be really sensitive to work with, especially considering that it doesn't have any functionality with OUTER APPLY. This is the logic for my latest statement:

SELECT b.id,
       b.title,
       b.time AS borrow_time,
       MIN(r.time) AS return_time
FROM (
      SELECT id,
             title,
             time
      FROM books
      WHERE event = 'BORROW'
     ) b
OUTER JOIN (
            SELECT id,
                   time
            FROM books
            WHERE event = 'RETURN'
           ) r
        ON b.id = r.id
       AND b.time < r.time
GROUP BY b.id,
         b.title,
         borrow_time
ORDER BY borrow_time;

Any ideas to get around this would be greatly appreciated!

Upvotes: 2

Views: 7186

Answers (2)

Fahmi
Fahmi

Reputation: 37473

Try with CASE WHEN and row_number() function :

with pcte as 
(
 SELECT id,
           title,event,time, row_number() over(order by id,title,event) as rn
           FROM books
)
    SELECT id,
           title,
           case when event = 'BORROW' then b.time end AS borrow_time,
           case when event = 'RETURN' then b.time end AS return_time
     FROM pcte order by id, title, rn

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269503

Assuming the borrows and returns are all paired, you can enumerate them and then use conditional aggregation:

select id, title,
       max(case when event = 'BORROW' then b.time end) as borrow_time,
       max(case when event = 'RETURN' then b.time end) as return_time
from (select b.*,
             row_number() over (partition by b.id, b.event order by b.time) as sequm
      from books b
     ) b
group by id, title, seqnum
order by id, title, seqnum;

Upvotes: 1

Related Questions