Reputation: 91
I am trying to find the preceding row to a specific occurrence in a database, or rather some data from it.
In this example I would like to find the movement_method of the preceding row (sorted by timestamp) before a user visited the pub. So in tom's example I would like to know that tom went home by car before visiting the pub. (it doesn't matter how he traveled to the pub but rather the used method before going to the pub)
I have an example database with: user, location, movement_method, timestamp:
user | location | movement_method | timestamp |
---|---|---|---|
tom | work | car | 2022-03-02 14:30 |
tom | home | car | 2022-03-02 20:30 |
tom | pub | bus | 2022-03-02 22:30 |
tom | home | foot | 2022-03-03 02:30 |
jane | school | bus | 2022-03-02 08:30 |
jane | home | bus | 2022-03-02 14:30 |
jane | pub | foot | 2022-03-02 21:30 |
jane | home | bus | 2022-03-02 23:30 |
lila | work | bus | 2022-03-02 08:30 |
lila | home | bus | 2022-03-02 16:30 |
jake | friend | car | 2022-03-02 15:30 |
jake | home | bus | 2022-03-02 20:30 |
jake | pub | car | 2022-03-02 20:30 |
jake | home | car | 2022-03-03 02:30 |
For this database the result I would want would be: | user | preceding_movement_method | | ---- | ------- | | tom | car | | jane | bus | | jake | bus |
My current approach is to have a partition or window function for "preceding_movement_method" but I'm stuck finding the "preceding" entry before the one which fits the where statement.
So I'm looking for something like this pseudocode:
select user,
(select preceding movement_method
from movement_database
where location = 'pub'
order by timestamp) as preceding_movement_method
from movement_database
Upvotes: 0
Views: 514
Reputation: 11032
The LAG() window function is where I would go with this. I set the (sqlfiddle) data up as:
create table movements (
uname varchar(16),
location varchar(16),
movement_method varchar(16),
ts timestamp
);
insert into movements values
('tom', 'work', 'car', '2022-03-02 14:30'),
('tom', 'home', 'car', '2022-03-02 20:30'),
('tom', 'pub', 'bus', '2022-03-02 22:30'),
('tom', 'home', 'foot', '2022-03-03 02:30'),
('jane', 'school', 'bus', '2022-03-02 08:30'),
('jane', 'home', 'bus', '2022-03-02 14:30'),
('jane', 'pub', 'foot', '2022-03-02 21:30'),
('jane', 'home', 'bus', '2022-03-02 23:30'),
('lila', 'work', 'bus', '2022-03-02 08:30'),
('lila', 'home', 'bus', '2022-03-02 16:30'),
('jake', 'friend', 'car', '2022-03-02 15:30'),
('jake', 'home', 'bus', '2022-03-02 20:30'),
('jake', 'pub', 'car', '2022-03-02 20:30'),
('jake', 'home', 'car', '2022-03-03 02:30');
And the SQL as:
select uname, pmove
from (
select uname, location,
lag (movement_method) over (partition by uname order by ts) as pmove
from movements) as subq
where location = 'pub';
Now many of the timestamps for Jake are all the same so there is some uncertainty there.
I'd stay away from cross joins / loop joins since you are in Redshift and this implies very large datasets and these processes can explode with such large data.
Upvotes: 1
Reputation: 2152
Well, I'm not sure if it's a typo but the user jake has an IDENTICAL timestamp at home as at pub's which is an unlikely event. The code may seem a bit complicated, but it does take the problem into consideration.
select t1.`user`, movement_method from movement t1 join
(select m.`user`, max(m.`timestamp`) mx from movement m
join
(select `user`,`timestamp` from movement where location ='pub') t
on m.`user` = t.`user`
where m.`timestamp` <=t.`timestamp` and m.`location`!='pub'
group by `user`) t2
on t1.`user`=t2.`user` and t1.`timestamp`=mx and t1.location!='pub';
Upvotes: 0