Kalerion
Kalerion

Reputation: 91

Find preceding row / entry under specific conditions in SQL / Redshift

I am trying to find the preceding row to a specific occurrence in a database, or rather some data from it.

In this example I would like to find the movement_method of the preceding row (sorted by timestamp) before a user visited the pub. So in tom's example I would like to know that tom went home by car before visiting the pub. (it doesn't matter how he traveled to the pub but rather the used method before going to the pub)

I have an example database with: user, location, movement_method, timestamp:

user location movement_method timestamp
tom work car 2022-03-02 14:30
tom home car 2022-03-02 20:30
tom pub bus 2022-03-02 22:30
tom home foot 2022-03-03 02:30
jane school bus 2022-03-02 08:30
jane home bus 2022-03-02 14:30
jane pub foot 2022-03-02 21:30
jane home bus 2022-03-02 23:30
lila work bus 2022-03-02 08:30
lila home bus 2022-03-02 16:30
jake friend car 2022-03-02 15:30
jake home bus 2022-03-02 20:30
jake pub car 2022-03-02 20:30
jake home car 2022-03-03 02:30

For this database the result I would want would be: | user | preceding_movement_method | | ---- | ------- | | tom | car | | jane | bus | | jake | bus |

My current approach is to have a partition or window function for "preceding_movement_method" but I'm stuck finding the "preceding" entry before the one which fits the where statement.

So I'm looking for something like this pseudocode:

select user,
 (select preceding movement_method 
  from movement_database 
  where location = 'pub'
  order by timestamp) as preceding_movement_method
from movement_database

Upvotes: 0

Views: 514

Answers (2)

Bill Weiner
Bill Weiner

Reputation: 11032

The LAG() window function is where I would go with this. I set the (sqlfiddle) data up as:

create table movements (
  uname varchar(16),
  location  varchar(16),
  movement_method   varchar(16),
  ts timestamp
);
 
insert into movements values
('tom', 'work', 'car', '2022-03-02 14:30'),
('tom', 'home', 'car', '2022-03-02 20:30'),
('tom', 'pub', 'bus', '2022-03-02 22:30'),
('tom', 'home', 'foot', '2022-03-03 02:30'),
('jane', 'school', 'bus', '2022-03-02 08:30'),
('jane', 'home', 'bus', '2022-03-02 14:30'),
('jane', 'pub', 'foot', '2022-03-02 21:30'),
('jane', 'home', 'bus', '2022-03-02 23:30'),
('lila', 'work', 'bus', '2022-03-02 08:30'),
('lila', 'home', 'bus', '2022-03-02 16:30'),
('jake', 'friend', 'car', '2022-03-02 15:30'),
('jake', 'home', 'bus', '2022-03-02 20:30'),
('jake', 'pub', 'car', '2022-03-02 20:30'),
('jake', 'home', 'car', '2022-03-03 02:30');

And the SQL as:

select uname, pmove 
from (
  select uname, location,
    lag (movement_method) over (partition by uname order by ts) as pmove
  from movements) as subq
where location = 'pub';

Now many of the timestamps for Jake are all the same so there is some uncertainty there.

I'd stay away from cross joins / loop joins since you are in Redshift and this implies very large datasets and these processes can explode with such large data.

Upvotes: 1

blabla_bingo
blabla_bingo

Reputation: 2152

Well, I'm not sure if it's a typo but the user jake has an IDENTICAL timestamp at home as at pub's which is an unlikely event. The code may seem a bit complicated, but it does take the problem into consideration.

select t1.`user`, movement_method from movement t1 join
    (select m.`user`, max(m.`timestamp`) mx from movement m 
    join
        (select `user`,`timestamp` from movement where location ='pub') t
        on m.`user` = t.`user` 
        where  m.`timestamp` <=t.`timestamp` and m.`location`!='pub'
        group by `user`) t2
on t1.`user`=t2.`user` and t1.`timestamp`=mx and t1.location!='pub';

Upvotes: 0

Related Questions