Praxiteles
Praxiteles

Reputation: 6030

Why is BigQuery converting some dates to timestamps but not others?

Our SQL query shown below is converting strings to timestamp fields but failing on some dates and not on others. What is causing this conversion to fail?

SELECT birthdate, TIMESTAMP(REGEXP_REPLACE(birthdate, r'(..)/(..)/(....)', r'\3-\2-\1')) ts
FROM [our_project:our_table] LIMIT 1000

Here are the results. Notice that BigQuery is giving "null" for many of the dates. Why is the regex failing? Is there something to add to make it more robust?

sql results

Here is a second conversion query we tried.

SELECT birthdate, TIMESTAMP(year + '-' + month + '-' + day) as output_timestamp
FROM (
  SELECT 
    birthdate, 
    REGEXP_EXTRACT(birthdate, '.*/([0-9]{4})$') as year, 
    REGEXP_EXTRACT(birthdate, '^([0-9]{2}).*') as day, 
    REGEXP_EXTRACT(birthdate, '.*/([0-9]{2})/.*') AS month 
  FROM 
    [our_project:our_table]
)
LIMIT 1000

Notice that nulls appeared in these results as well.

sql_query2

How might we fix what is going wrong?

Upvotes: 0

Views: 1261

Answers (3)

Mosha Pasumansky
Mosha Pasumansky

Reputation: 14014

If your data has custom formatting of timestamps, you can always use PARSE_TIMESTAMP function in Standard (non-legacy) SQL - https://cloud.google.com/bigquery/sql-reference/functions-and-operators#parse_timestamp I.e. all the following queries

select parse_timestamp("%Y-%d-%m", x) from
unnest(["2016-31-12", "1999-01-02"]) x

select parse_timestamp("%Y-%d-%m", x) from
unnest(["2016-31-12", "1999-01-02"]) x

select parse_timestamp("%Y-%b-%d", x) from
unnest(["2016-Dec-31", "1999-Feb-01"]) x

results in

f0_  
1   2016-12-31 00:00:00 UTC  
2   1999-02-01 00:00:00 UTC 

Upvotes: 1

Thang
Thang

Reputation: 137

Is there a reason you're not using the supported TIMESTAMP data type?

From the docs:

You can describe TIMESTAMP data types as either UNIX timestamps or calendar datetimes.

Datetimes need to be in a specific format:

A date and time string in the format YYYY-MM-DD HH:MM:SS. The UTC and Z specifiers are supported.

This would also make it easier to query this particular column, as it would allow you to leverage BigQuery's standard SQL dialect. Commands such as HOUR, DAYOFWEEK, DAYOFYEAR, etc.

Here's an example query using one of BQ's public datasets to find the most popular pickup hour using a timestamp field:

SELECT
  HOUR(pickup_datetime) as pickup_hour, 
  COUNT(*) as pickup_count
FROM
  [nyc-tlc:green.trips_2014]
GROUP BY
  1
ORDER BY 
  pickup_count DESC

will yield:

Row pickup_hour pickup_count     
1   19  1059068  
2   18  1051326  
3   20  985664   
4   17  957583   
5   21  938378   
6   22  908296   

Upvotes: 3

Praxiteles
Praxiteles

Reputation: 6030

It turns out that the month and the day were swapped (international versus U.S.) The result is that the ranges were invalid for the timestamp. Once we swapped the day and the month - then the conversions occurred without problems.

Upvotes: 2

Related Questions