Reputation: 168
Edit 1: so the issue is '<=' is acting as '<' in google query which is strange. But '>=' acts normally. Any idea why this is happening?
Goal: to get data for May 2019.
Info about database here: https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/
Query 1 uses timestamp > '2019-04-30' AND timestamp < '2019-06-01'
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp > '2019-04-30' AND timestamp < '2019-06-01'
GROUP BY month, package;
Query 2 uses timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
GROUP BY month, package;
Both query one and two should scan same amount of data - May 2019 but both query gives different results and scans different amount of data as you can see in attached images.
Which one is correct and why both are not matching?
Upvotes: 1
Views: 909
Reputation: 12274
You're comparing timestamp
with a date
literal. When a date
literal is implicitly cast as timestamp
, it will have '00:00:00' time.
This is same as
timestamp > '2019-04-30 00:00:00 UTC' AND timestamp < '2019-06-01 00:00:00 UTC'
which includes data between 2019-04-30 00:00:01 UTC
and 2019-04-30 23:59:59 UTC
.
same as
timestamp >= '2019-05-01 00:00:00 UTC' AND timestamp <= '2019-05-31 00:00:00 UTC'
in this case, you're missing data between 2019-05-31 00:00:01 UTC
and 2019-05-31 23:59:59 UTC
which is incorrect.
You might want to use:
timestamp >= '2019-05-01' AND timestamp < '2019-06-01'
Note that since BEWEEN
condition is inclusive, following conditions will not be what you want also.
WHERE timestamp BETWEEN '2019-05-01' AND '2019-05-31' --> this will ignore data on last day of May except '2019-05-31 00:00:00 UTC'.
or
WHERE timestamp BETWEEN '2019-05-01' AND '2019-06-01' --> this will include '2019-06-01 00:00:00 UTC' data like below screenshot.
SELECT EXTRACT(MONTH FROM timestamp) month, COUNT(1) cnt
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-06-01' -- scan 22.57 GB
GROUP BY 1
(update)
SELECT EXTRACT(DAY FROM timestamp) day, COUNT(1) cnt
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-29' AND '2019-05-31'
GROUP BY 1
;
output:
+-----+-----+-----------+
| Row | day | cnt |
+-----+-----+-----------+
| 1 | 30 | 116744449 |
| 2 | 29 | 120865824 |
| 3 | 31 | 1027 | -- should be 112116613
+-----+-----+-----------+
Upvotes: 1
Reputation: 1321
The two filters are different, you can simply check the difference in the result by the below script.
SELECT timestamp, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
timestamp > '2019-04-30' AND timestamp < '2019-06-01'
AND NOT (timestamp >= '2019-05-01' AND timestamp <= '2019-05-31')
;
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-05-31'
p.s. As you can check out in the doc, the ordering of the standard SQL is below. The filter
of WHERE
happens before the SELECT
, thus you might want to store the result of the SELECT
statement and do the filtering to filter by date, not datetime.
FROM
->WHERE
->GROUP BY
->HAVING
-> ...
Upvotes: 0