MI MA
MI MA

Reputation: 181

SQL query: Difference between EXTRACT(DATE FROM timestamp) vs. WHERE timestamp

I'm querying for the number of trips throughout the hours of the day with the avg mph, within a given date range.

What exactly is the difference between these two functions for working with timestamp data and can someone please explain why the first value in the num_trips column is different? See below

   (
   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
   trip_seconds, trip_miles
   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
   WHERE EXTRACT(DATE FROM trip_start_timestamp) >= '2017-01-01'
   AND EXTRACT(DATE FROM trip_start_timestamp) < '2017-07-01'
           AND trip_seconds > 0 
           AND trip_miles > 0
       )
       SELECT hour_of_day, 
       COUNT(1) as num_trips,
       (3600 * SUM(trip_miles) / SUM(trip_seconds)) as avg_mph
               FROM RelevantRides1
               GROUP BY hour_of_day
               ORDER BY hour_of_day 

   (
   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
   trip_miles, trip_seconds
   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE trip_start_timestamp > '2017-01-01' AND 
                         trip_start_timestamp < '2017-07-01' AND 
                         trip_seconds > 0 AND 
                         trip_miles > 0
         )
         SELECT hour_of_day, 
         COUNT(1) AS num_trips, 
         3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
             FROM RelevantRides
             GROUP BY hour_of_day
             ORDER BY hour_of_day

I expected both queries to return the same result but when I print the dataframe, the first query gives a different result for num_trips for the first hour of the day, where I used Extract, the rest of the results are identical.

Upvotes: 0

Views: 1059

Answers (2)

Spasnof
Spasnof

Reputation: 140

So I cannot make the original example run because I get a Syntax error: Unexpected keyword SELECT at [10:8] if I attempt to copy paste the query in a new project using the bigquery-public-data.chicago_taxi_trips table. That said I will attempt an answer using the first select statements given...

User Mikhail Berlyant gave part of the solution which is that you are mixing >= and > so you have two different queries and are comparing apples with oranges.

If you run the following queries you will notice the same record count being returned.

SELECT count(*) from
 (
  SELECT 
    EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
    trip_seconds, trip_miles
  FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
  WHERE EXTRACT(DATE FROM trip_start_timestamp) >= '2017-01-01'
    AND EXTRACT(DATE FROM trip_start_timestamp) < '2017-07-01'
    AND trip_seconds > 0 
    AND trip_miles > 0
 ) t;
-- returns 11460748
SELECT count(*) from 
(
   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
   trip_miles, trip_seconds
   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
  WHERE trip_start_timestamp >= '2017-01-01' 
    AND trip_start_timestamp < '2017-07-01'  
    AND trip_seconds > 0 AND 
    trip_miles > 0
) t1
-- returns 11460748

However if you do > on both ends you don't get the same results. (11409890 vs 11460748 then Why is that? Well that is because when you use EXTRACT(DATE FROM trip_start_timestamp) you floor the field prior to comparison. So for example:

select '2017-01-01 00:15:00' > '2017-01-01'; --true
select extract(date from timestamp '2017-01-01 00:15:00'); --2017-01-01
select extract(date from timestamp '2017-01-01 00:15:00') > '2017-01-01'; --false
  1. In the first statement we say that 2017-01-01 00:15:00 is greater than 2017-01-01 00:00:00 (the extra hour/minute/second of precision is implied)
  2. In the second statement you see what your value floors to (2017-01-01 or 2017-01-01 00:00:00 with implied precision)
  3. When we compare the floored value the given value and factor in the implied precision 2017-01-01 00:00:00 > 2017-01-01 00:15:00 you return false so you exclude any records on the same day. 2017-01-01 06:15:00, 2017-01-01 22:15:00etc

I would recommend using the Minimal, reproducible, viable example docs available here.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

I think the issue is in trip_start_timestamp > '2017-01-01'
To fix discrepancy you should use trip_start_timestamp >= '2017-01-01'

Below simplified example reproduces the issue

#standardSQL
WITH `project.dataset.table` AS (
  SELECT CURRENT_TIMESTAMP() trip_start_timestamp UNION ALL
  SELECT TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY), DAY) UNION ALL
  SELECT TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
)
SELECT trip_start_timestamp
FROM `project.dataset.table`
WHERE trip_start_timestamp > '2019-08-12'

returns

Row trip_start_timestamp     
1   2019-08-13 05:04:34.747114 UTC   

while

SELECT trip_start_timestamp
FROM `project.dataset.table`
WHERE trip_start_timestamp >= '2019-08-12'   

returns

Row trip_start_timestamp     
1   2019-08-13 05:05:38.784956 UTC   
2   2019-08-12 00:00:00 UTC     

Upvotes: 1

Related Questions