Reputation: 181
I'm querying for the number of trips throughout the hours of the day with the avg mph, within a given date range.
What exactly is the difference between these two functions for working with timestamp data and can someone please explain why the first value in the num_trips column is different? See below
(
SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day,
trip_seconds, trip_miles
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE EXTRACT(DATE FROM trip_start_timestamp) >= '2017-01-01'
AND EXTRACT(DATE FROM trip_start_timestamp) < '2017-07-01'
AND trip_seconds > 0
AND trip_miles > 0
)
SELECT hour_of_day,
COUNT(1) as num_trips,
(3600 * SUM(trip_miles) / SUM(trip_seconds)) as avg_mph
FROM RelevantRides1
GROUP BY hour_of_day
ORDER BY hour_of_day
(
SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day,
trip_miles, trip_seconds
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE trip_start_timestamp > '2017-01-01' AND
trip_start_timestamp < '2017-07-01' AND
trip_seconds > 0 AND
trip_miles > 0
)
SELECT hour_of_day,
COUNT(1) AS num_trips,
3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
FROM RelevantRides
GROUP BY hour_of_day
ORDER BY hour_of_day
I expected both queries to return the same result but when I print the dataframe, the first query gives a different result for num_trips for the first hour of the day, where I used Extract
, the rest of the results are identical.
Upvotes: 0
Views: 1059
Reputation: 140
So I cannot make the original example run because I get a Syntax error: Unexpected keyword SELECT at [10:8]
if I attempt to copy paste the query in a new project using the bigquery-public-data.chicago_taxi_trips
table. That said I will attempt an answer using the first select statements given...
User Mikhail Berlyant gave part of the solution which is that you are mixing >=
and >
so you have two different queries and are comparing apples with oranges.
If you run the following queries you will notice the same record count being returned.
SELECT count(*) from
(
SELECT
EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day,
trip_seconds, trip_miles
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE EXTRACT(DATE FROM trip_start_timestamp) >= '2017-01-01'
AND EXTRACT(DATE FROM trip_start_timestamp) < '2017-07-01'
AND trip_seconds > 0
AND trip_miles > 0
) t;
-- returns 11460748
SELECT count(*) from
(
SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day,
trip_miles, trip_seconds
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE trip_start_timestamp >= '2017-01-01'
AND trip_start_timestamp < '2017-07-01'
AND trip_seconds > 0 AND
trip_miles > 0
) t1
-- returns 11460748
However if you do >
on both ends you don't get the same results. (11409890 vs 11460748 then Why is that? Well that is because when you use EXTRACT(DATE FROM trip_start_timestamp)
you floor the field prior to comparison. So for example:
select '2017-01-01 00:15:00' > '2017-01-01'; --true
select extract(date from timestamp '2017-01-01 00:15:00'); --2017-01-01
select extract(date from timestamp '2017-01-01 00:15:00') > '2017-01-01'; --false
2017-01-01 00:15:00
is greater than 2017-01-01 00:00:00
(the extra hour/minute/second of precision is implied)2017-01-01
or 2017-01-01 00:00:00
with implied precision)2017-01-01 00:00:00
> 2017-01-01 00:15:00
you return false so you exclude any records on the same day. 2017-01-01 06:15:00
, 2017-01-01 22:15:00
etcI would recommend using the Minimal, reproducible, viable example docs available here.
Upvotes: 0
Reputation: 172993
I think the issue is in trip_start_timestamp > '2017-01-01'
To fix discrepancy you should use trip_start_timestamp >= '2017-01-01'
Below simplified example reproduces the issue
#standardSQL
WITH `project.dataset.table` AS (
SELECT CURRENT_TIMESTAMP() trip_start_timestamp UNION ALL
SELECT TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY), DAY) UNION ALL
SELECT TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
)
SELECT trip_start_timestamp
FROM `project.dataset.table`
WHERE trip_start_timestamp > '2019-08-12'
returns
Row trip_start_timestamp
1 2019-08-13 05:04:34.747114 UTC
while
SELECT trip_start_timestamp
FROM `project.dataset.table`
WHERE trip_start_timestamp >= '2019-08-12'
returns
Row trip_start_timestamp
1 2019-08-13 05:05:38.784956 UTC
2 2019-08-12 00:00:00 UTC
Upvotes: 1