yf879
yf879

Reputation: 168

Google Big query different result based on same date filter

Edit 1: so the issue is '<=' is acting as '<' in google query which is strange. But '>=' acts normally. Any idea why this is happening?

Goal: to get data for May 2019.

Info about database here: https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/

Query 1 uses timestamp > '2019-04-30' AND timestamp < '2019-06-01'

SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads` 
WHERE timestamp > '2019-04-30' AND timestamp < '2019-06-01'
GROUP BY month, package;

enter image description here

Query 2 uses timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'

SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads` 
WHERE timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
GROUP BY month, package;

enter image description here

Both query one and two should scan same amount of data - May 2019 but both query gives different results and scans different amount of data as you can see in attached images.

Which one is correct and why both are not matching?

Upvotes: 1

Views: 909

Answers (2)

Jaytiger
Jaytiger

Reputation: 12274

You're comparing timestamp with a date literal. When a date literal is implicitly cast as timestamp, it will have '00:00:00' time.

Query 1 uses timestamp > '2019-04-30' AND timestamp < '2019-06-01'

This is same as

timestamp > '2019-04-30 00:00:00 UTC' AND timestamp < '2019-06-01 00:00:00 UTC'

which includes data between 2019-04-30 00:00:01 UTC and 2019-04-30 23:59:59 UTC.

Query 2 uses timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'

same as

timestamp >= '2019-05-01 00:00:00 UTC' AND timestamp <= '2019-05-31 00:00:00 UTC'

in this case, you're missing data between 2019-05-31 00:00:01 UTC and 2019-05-31 23:59:59 UTC which is incorrect.

Correct Condition

You might want to use:

timestamp >= '2019-05-01' AND timestamp < '2019-06-01'

Note that since BEWEEN condition is inclusive, following conditions will not be what you want also.

WHERE timestamp BETWEEN  '2019-05-01' AND '2019-05-31' --> this will ignore data on last day of May except '2019-05-31 00:00:00 UTC'.
or
WHERE timestamp BETWEEN  '2019-05-01' AND '2019-06-01' --> this will include '2019-06-01 00:00:00 UTC' data like below screenshot.

enter image description here

SELECT EXTRACT(MONTH FROM timestamp) month, COUNT(1) cnt
  FROM `bigquery-public-data.pypi.file_downloads`
 WHERE timestamp BETWEEN  '2019-05-01' AND '2019-06-01' -- scan 22.57 GB
 GROUP BY 1

(update)

SELECT EXTRACT(DAY FROM timestamp) day, COUNT(1) cnt
  FROM `bigquery-public-data.pypi.file_downloads`
 WHERE timestamp BETWEEN '2019-05-29' AND '2019-05-31'
 GROUP BY 1
;

output:
+-----+-----+-----------+
| Row | day |    cnt    |
+-----+-----+-----------+
|   1 |  30 | 116744449 |
|   2 |  29 | 120865824 |
|   3 |  31 |      1027 |  -- should be 112116613
+-----+-----+-----------+

Upvotes: 1

Jiho Choi
Jiho Choi

Reputation: 1321

The two filters are different, you can simply check the difference in the result by the below script.

Differences
SELECT timestamp, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads` 
WHERE
    timestamp > '2019-04-30' AND timestamp < '2019-06-01'
    AND NOT (timestamp >= '2019-05-01' AND timestamp <= '2019-05-31')
;
Results

enter image description here

Personal Preference
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-05-31'

p.s. As you can check out in the doc, the ordering of the standard SQL is below. The filter of WHERE happens before the SELECT, thus you might want to store the result of the SELECT statement and do the filtering to filter by date, not datetime.

FROM -> WHERE -> GROUP BY -> HAVING -> ...

Upvotes: 0

Related Questions