Reputation: 6030
Our SQL query shown below is converting strings to timestamp fields but failing on some dates and not on others. What is causing this conversion to fail?
SELECT birthdate, TIMESTAMP(REGEXP_REPLACE(birthdate, r'(..)/(..)/(....)', r'\3-\2-\1')) ts
FROM [our_project:our_table] LIMIT 1000
Here are the results. Notice that BigQuery is giving "null" for many of the dates. Why is the regex failing? Is there something to add to make it more robust?
Here is a second conversion query we tried.
SELECT birthdate, TIMESTAMP(year + '-' + month + '-' + day) as output_timestamp
FROM (
SELECT
birthdate,
REGEXP_EXTRACT(birthdate, '.*/([0-9]{4})$') as year,
REGEXP_EXTRACT(birthdate, '^([0-9]{2}).*') as day,
REGEXP_EXTRACT(birthdate, '.*/([0-9]{2})/.*') AS month
FROM
[our_project:our_table]
)
LIMIT 1000
Notice that nulls appeared in these results as well.
How might we fix what is going wrong?
Upvotes: 0
Views: 1261
Reputation: 14014
If your data has custom formatting of timestamps, you can always use PARSE_TIMESTAMP function in Standard (non-legacy) SQL - https://cloud.google.com/bigquery/sql-reference/functions-and-operators#parse_timestamp I.e. all the following queries
select parse_timestamp("%Y-%d-%m", x) from
unnest(["2016-31-12", "1999-01-02"]) x
select parse_timestamp("%Y-%d-%m", x) from
unnest(["2016-31-12", "1999-01-02"]) x
select parse_timestamp("%Y-%b-%d", x) from
unnest(["2016-Dec-31", "1999-Feb-01"]) x
results in
f0_
1 2016-12-31 00:00:00 UTC
2 1999-02-01 00:00:00 UTC
Upvotes: 1
Reputation: 137
Is there a reason you're not using the supported TIMESTAMP data type?
From the docs:
You can describe TIMESTAMP data types as either UNIX timestamps or calendar datetimes.
Datetimes need to be in a specific format:
A date and time string in the format YYYY-MM-DD HH:MM:SS. The UTC and Z specifiers are supported.
This would also make it easier to query this particular column, as it would allow you to leverage BigQuery's standard SQL dialect. Commands such as HOUR, DAYOFWEEK, DAYOFYEAR, etc.
Here's an example query using one of BQ's public datasets to find the most popular pickup hour using a timestamp field:
SELECT
HOUR(pickup_datetime) as pickup_hour,
COUNT(*) as pickup_count
FROM
[nyc-tlc:green.trips_2014]
GROUP BY
1
ORDER BY
pickup_count DESC
will yield:
Row pickup_hour pickup_count
1 19 1059068
2 18 1051326
3 20 985664
4 17 957583
5 21 938378
6 22 908296
Upvotes: 3
Reputation: 6030
It turns out that the month and the day were swapped (international versus U.S.) The result is that the ranges were invalid for the timestamp. Once we swapped the day and the month - then the conversions occurred without problems.
Upvotes: 2