Reputation: 31
Pretty simple schema of a USER_ID and a bunch of dates that correspond to "stages" of the contract cycle. Problem is that I need the most recent date and/or the data element with the most recent date.
Given a schema of:
USER STRING
date1 Timestamp
date2 Timestamp
date3 Timestamp
date4 Timestamp
date5 Timestamp
USER date1 date2 date3 date4 date5
Fred 01/01/1980 02/02/1980 03/15/1980 04/15/1980 05/15/1980
Jane 02/01/1980 03/01/1980 02/15/1980 02/01/1980 02/01/1980
The following code is clunky and assumes that stages are all done in order which isn't always the case.
CASE WHEN date5 is NOT NULL THEN DATE(date5)
WHEN date4 is NOT NULL THEN DATE(date4)
WHEN date3 is NOT NULL THEN DATE(date3)
WHEN date2 is NOT NULL THEN DATE(date2)
ELSE date1 END AS Last_Date,
What I'd love to be able to do is a simple MAX(date1, date2, date3, date4, date5) which would give me the most recent date out of the list.
SELECT USERID, MAX(date1, date2, date3, date4, date5)
In the above example is would return:
USERID | Last_Date
Fred | 05/15/1980
Jane | 03/01/1980
Also very useful is to retrieve the actual data element that is the most recent returning "date2" if it is the most recent.
So again:
SELECT USERID, MAX_ELEMENT(date1,date2,date3,date4,date5) AS LATEST_ELEMENT
would retrieve:
USERID | LATEST_ELEMENT
Fred | date5
Jane | date2
Seems like there should be a better / more elegant or performant method.
Mikhail Berlyant has a great solution but I've discovered that there are a lot of stages with the same DATE but close TIMESTAMPS as follows:
Chooses the first DATE so if Stage1 and Stage2 were two hours apart on the same day then Stage1 is considered the last which is incorrect here.
Id date1 date2 last_col last_date
FRED 2020-09-14 2020-09-14 date1 2020-09-14
JANE 2019-04-13 2019-04-13 date1 2019-04-13
Using timestamps also gets some incorrect results due to the time not being accounted for:
Id date1 date2 last_col last_date
FRED 2020-08-13 23:35:38 UTC 2020-08-13 23:49:06 UTC date1 2020-08-13T23
JANE 2021-01-07 02:16:41 UTC null date2 null
AMY 2021-01-07 02:35:08 UTC 2021-01-07 23:26:20 UTC date2 2021-01-07T23
Apparently hours are handled but not minutes or seconds. Also nulls apparently evaluate ahead of any other but should actually be discarded as they are no action taken fields.
Upvotes: 0
Views: 103
Reputation: 31
@mikhail-berlyant Provided an answer given my understanding of the data at the time. A couple of modifications.
Complete code as I am using it:
#standardSQL
SELECT
t.Id,
date1,
date2,
date3,
ARRAY_AGG( STRUCT(SPLIT(col_value, ':')[
OFFSET (0)] AS last_col,SPLIT(col_value, ':')[OFFSET(1)] AS last_date)
ORDER BY SPLIT(col_value, ':')[OFFSET (1)] DESC LIMIT 1 )[OFFSET (0)].*
FROM
`DATATABLE` t,
UNNEST(SPLIT(TRANSLATE(to_json_STRING((
SELECT
AS STRUCT
coalesce(UNIX_SECONDS(t.date1),0) as date1,
coalesce(UNIX_SECONDS(t.date2),0) as date2,
coalesce(UNIX_SECONDS(t.date3),0) as date3)), '{}"', ''))) col_value
GROUP BY
Id, date1, date2, date3
Upvotes: 0
Reputation: 172993
Below is for BigQuery Standard SQL
#standardSQL
select user,
array_agg(
struct(split(col_value, ':')[offset(0)] as last_col, split(col_value, ':')[offset(1)] as last_date)
order by parse_date('%m/%d/%Y', split(col_value, ':')[offset(1)]) desc
limit 1
)[offset(0)].*
from `project.dataset.yourtable` t,
unnest(split(translate(to_json_string((select as struct t.* except(user))), '{}"', ''))) col_value
group by user
If to apply to sample data from your question
with `project.dataset.yourtable` as (
select 'Fred' user, '01/01/1980' date1, '02/02/1980' date2, '03/15/1980' date3, '04/15/1980' date4, '05/15/1980' date5 union all
select 'Jane', '02/01/1980', '03/01/1980', '02/15/1980', '02/01/1980', '02/01/1980'
)
output is
In case if your dateN columns are of timestamp data type - use below version
select user,
array_agg(
struct(split(col_value, ':')[offset(0)] as last_col, regexp_extract(col_value, ':(.*)') as last_date)
order by regexp_extract(col_value, ':(.*)') desc
limit 1
)[offset(0)].*
from `project.dataset.yourtable` t,
unnest(split(translate(to_json_string((select as struct t.* except(user))), '{}"', ''))) col_value
group by user
Upvotes: 1