Reputation: 517
I am a SQL noob in need of some help with a specific query using the NYC 2013 Taxi Trips Dataset located here.
I want to analyze dropoffs at JFK Airport, but want to build my query so that I can include the next subsequent pickup that a taxi does after dropping off someone at the airport.
This query gets me all the trips at the airport for a given day:
SELECT * FROM [833682135931:nyctaxi.trip_data]
WHERE DATE(pickup_datetime) = '2013-05-01'
AND FLOAT(pickup_latitude) < 40.651381
AND FLOAT(pickup_latitude) > 40.640668
AND FLOAT(pickup_longitude) < -73.776283
AND FLOAT(pickup_longitude) > -73.794694
I want to join the dataset with itself to add next_pickup_time, next_pickup_lat, and next_pickup_lon values for each row.
To do this, I assume I need a correlated subquery, but don't know where to start building it out because the subquery is based on the outer query.
It needs to search for trips with the same medallion, on the same day, and with a pickup time later than the current airport dropoff, then limit 1... Any help is much appreciated!
Upvotes: 2
Views: 2659
Reputation: 208042
This should give you all the dropoffs with next pickups
SELECT *
FROM
(SELECT medallion,
dropoff_datetime,
dropoff_longitude,
dropoff_latitude,
LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
ORDER BY pickup_datetime) AS next_datetime,
LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
ORDER BY pickup_datetime) AS next_longitude,
LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
ORDER BY pickup_datetime) AS next_latitude
FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
AND DATE(dropoff_datetime) = '2013-05-01'
AND FLOAT(dropoff_latitude) < 40.651381
AND FLOAT(dropoff_latitude) > 40.640668
AND FLOAT(dropoff_longitude) < -73.776283
AND FLOAT(dropoff_longitude) > -73.794694
Upvotes: 1
Reputation: 517
This is what finally worked, modified from Pentium10's answer:
SELECT *
FROM
(SELECT medallion,
dropoff_datetime,
dropoff_longitude,
dropoff_latitude,
LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
ORDER BY pickup_datetime) AS next_datetime,
LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
ORDER BY pickup_datetime) AS next_longitude,
LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
ORDER BY pickup_datetime) AS next_latitude
FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
AND DATE(dropoff_datetime) = '2013-05-01'
AND FLOAT(dropoff_latitude) < 40.651381
AND FLOAT(dropoff_latitude) > 40.640668
AND FLOAT(dropoff_longitude) < -73.776283
AND FLOAT(dropoff_longitude) > -73.794694
Upvotes: 0
Reputation: 1277
I think that N.N. has the right idea, except that you want LEAD instead of LAG to get the next pickup. For example, this query will produce the next pickup time, lat and long after a pickup at JFK.
SELECT
medallion,
pickup_datetime,
pickup_longitude,
pickup_latitude,
LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_datetime,
LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_longitude,
LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_latitude
FROM [833682135931:nyctaxi.trip_data]
WHERE DATE(pickup_datetime) = '2013-05-01'
AND FLOAT(pickup_latitude) < 40.651381
AND FLOAT(pickup_latitude) > 40.640668
AND FLOAT(pickup_longitude) < -73.776283
AND FLOAT(pickup_longitude) > -73.794694;
Any time you can avoid a self-join, it's good to do so.
Upvotes: 1