chriswhong
chriswhong

Reputation: 517

Self-Join with Subquery in Google BigQuery

I am a SQL noob in need of some help with a specific query using the NYC 2013 Taxi Trips Dataset located here.

I want to analyze dropoffs at JFK Airport, but want to build my query so that I can include the next subsequent pickup that a taxi does after dropping off someone at the airport.

This query gets me all the trips at the airport for a given day:

SELECT * FROM [833682135931:nyctaxi.trip_data] 
WHERE DATE(pickup_datetime) = '2013-05-01'
  AND FLOAT(pickup_latitude) < 40.651381
  AND FLOAT(pickup_latitude) > 40.640668
  AND FLOAT(pickup_longitude) < -73.776283
  AND FLOAT(pickup_longitude) > -73.794694

I want to join the dataset with itself to add next_pickup_time, next_pickup_lat, and next_pickup_lon values for each row.

To do this, I assume I need a correlated subquery, but don't know where to start building it out because the subquery is based on the outer query.

It needs to search for trips with the same medallion, on the same day, and with a pickup time later than the current airport dropoff, then limit 1... Any help is much appreciated!

Upvotes: 2

Views: 2659

Answers (4)

Pentium10
Pentium10

Reputation: 208042

This should give you all the dropoffs with next pickups

SELECT *
FROM
  (SELECT medallion,
          dropoff_datetime,
          dropoff_longitude,
          dropoff_latitude,
          LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
                                             ORDER BY pickup_datetime) AS next_datetime,
          LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
                                                 ORDER BY pickup_datetime) AS next_longitude,
          LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
                                                ORDER BY pickup_datetime) AS next_latitude
   FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
  AND DATE(dropoff_datetime) = '2013-05-01'
  AND FLOAT(dropoff_latitude) < 40.651381
  AND FLOAT(dropoff_latitude) > 40.640668
  AND FLOAT(dropoff_longitude) < -73.776283
  AND FLOAT(dropoff_longitude) > -73.794694

Upvotes: 1

chriswhong
chriswhong

Reputation: 517

This is what finally worked, modified from Pentium10's answer:

SELECT *
FROM
  (SELECT medallion,
          dropoff_datetime,
          dropoff_longitude,
          dropoff_latitude,
          LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion
                                             ORDER BY pickup_datetime) AS next_datetime,
          LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion
                                                 ORDER BY pickup_datetime) AS next_longitude,
          LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion
                                                ORDER BY pickup_datetime) AS next_latitude
   FROM [833682135931:nyctaxi.trip_data]) d
WHERE date(next_datetime)=date(dropoff_datetime)
  AND DATE(dropoff_datetime) = '2013-05-01'
  AND FLOAT(dropoff_latitude) < 40.651381
  AND FLOAT(dropoff_latitude) > 40.640668
  AND FLOAT(dropoff_longitude) < -73.776283
  AND FLOAT(dropoff_longitude) > -73.794694

Upvotes: 0

sprocket
sprocket

Reputation: 1277

I think that N.N. has the right idea, except that you want LEAD instead of LAG to get the next pickup. For example, this query will produce the next pickup time, lat and long after a pickup at JFK.

SELECT
    medallion,
    pickup_datetime,
    pickup_longitude,
    pickup_latitude,
    LEAD(pickup_datetime, 1, "") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_datetime,
    LEAD(pickup_longitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_longitude,
    LEAD(pickup_latitude, 1, "0.0") OVER (PARTITION BY medallion ORDER BY pickup_datetime) AS next_latitude
FROM [833682135931:nyctaxi.trip_data]
WHERE DATE(pickup_datetime) = '2013-05-01'
  AND FLOAT(pickup_latitude) < 40.651381
  AND FLOAT(pickup_latitude) > 40.640668
  AND FLOAT(pickup_longitude) < -73.776283
  AND FLOAT(pickup_longitude) > -73.794694;

Any time you can avoid a self-join, it's good to do so.

Upvotes: 1

N.N.
N.N.

Reputation: 3172

consider using the LAG window function instead of self join

Upvotes: 1

Related Questions