NickBraunagel
NickBraunagel

Reputation: 1599

Presto - most current record before date

I have table A, containing employee IDs and sequential date steps since their employment start date (in this case 1x employee who started on 1/1/1900).

employee_id     date
1xxxxxxx      1/1/1900
1xxxxxxx      4/1/1900
1xxxxxxx      7/1/1900
1xxxxxxx      10/1/1900
1xxxxxxx      1/1/1901
1xxxxxxx      4/1/1901
1xxxxxxx      7/1/1901
1xxxxxxx      10/1/1901

I also have table B, containing all ratings an employee ever received:

employee_id  rating       date
1xxxxxxx       2        1/1/1900
1xxxxxxx       3        6/1/1900
1xxxxxxx       4        12/31/1900
1xxxxxxx       4        8/2/1901
1xxxxxxx       5        10/1/1901

How would I join B to A to obtain the most current rating for the employee as of date A.date? This is the desired output:

employee_id   date      rating
1xxxxxxx    1/1/1900        2
1xxxxxxx    4/1/1900        2
1xxxxxxx    7/1/1900        3
1xxxxxxx    10/1/1900       3
1xxxxxxx    1/1/1901        4
1xxxxxxx    4/1/1901        4
1xxxxxxx    7/1/1901        4
1xxxxxxx    10/1/1901       5

Upvotes: 1

Views: 2411

Answers (2)

Piotr Findeisen
Piotr Findeisen

Reputation: 20730

(This is supplemental to @GordonLinoff answer)

Subquery decorrelation in Presto currently is able to to decorrelate a predicate when it's "near the top". It is not currently capable of decorrelating ORDER BY or TopN (ORDER BY .. LIMIT) (issue). This can be worked around by using max_by instead. This turns a subquery into aggregation query which Presto can decorrelate.

WITH a AS ( SELECT 1 employee_id, 2 date ),
     b AS ( SELECT * FROM (VALUES (1, 2, 2), (1, 1, 1), (1, 3, 3)) t(employee_id, date, rating) )

select a.*,
       (select min_by(b.rating, b.date)
        from b
        where b.employee_id = a.employee_id and
              b.date <= a.date
       ) as rating
from a;
 employee_id | date | rating
-------------+------+--------
           1 |    2 |      1
(1 row)

Upvotes: 3

Gordon Linoff
Gordon Linoff

Reputation: 1269883

A simple way to express this uses a correlated subquery:

select a.*,
       (select b.rating
        from b
        where b.employee_id = a.employee_id and
              b.date <= a.date
        order by b.date desc
        limit 1
       ) as rating
from a;

If you need more than one column, then a more complicated query is needed. One approach is:

select ab.*
from (select a.*, b.*,  -- list the columns you want
             row_number() over (partition by a.employee_id, a.date order by b.date desc) as seqnum
      from a left join
           b
           on b.employee_id = a.employee_id and
              b.date <= a.date
     ) ab
where seqnum = 1;

Upvotes: 1

Related Questions