berlin
berlin

Reputation: 31

Retrieving most recent date from series of dates in a single record / row

Pretty simple schema of a USER_ID and a bunch of dates that correspond to "stages" of the contract cycle. Problem is that I need the most recent date and/or the data element with the most recent date.

Given a schema of:

USER  STRING
date1 Timestamp
date2 Timestamp
date3 Timestamp
date4 Timestamp
date5 Timestamp


USER   date1       date2       date3       date4       date5 
Fred   01/01/1980  02/02/1980  03/15/1980  04/15/1980  05/15/1980
Jane   02/01/1980  03/01/1980  02/15/1980  02/01/1980  02/01/1980

The following code is clunky and assumes that stages are all done in order which isn't always the case.

CASE WHEN date5 is NOT NULL THEN DATE(date5)
     WHEN date4 is NOT NULL THEN DATE(date4)
     WHEN date3 is NOT NULL THEN DATE(date3)
     WHEN date2 is NOT NULL THEN DATE(date2)
ELSE date1 END AS Last_Date,

First Ask: Get most recent date from list

What I'd love to be able to do is a simple MAX(date1, date2, date3, date4, date5) which would give me the most recent date out of the list.

SELECT USERID, MAX(date1, date2, date3, date4, date5)

In the above example is would return:

USERID | Last_Date
Fred   | 05/15/1980
Jane   | 03/01/1980

Second Request: Get most recent data element

Also very useful is to retrieve the actual data element that is the most recent returning "date2" if it is the most recent.

So again:

SELECT USERID, MAX_ELEMENT(date1,date2,date3,date4,date5) AS LATEST_ELEMENT

would retrieve:

USERID | LATEST_ELEMENT
Fred   | date5
Jane   | date2

Seems like there should be a better / more elegant or performant method.

Final Tuning to Account for Nulls and Timestamps (as opposed to dates)

Mikhail Berlyant has a great solution but I've discovered that there are a lot of stages with the same DATE but close TIMESTAMPS as follows:

DATE FORMATS

Chooses the first DATE so if Stage1 and Stage2 were two hours apart on the same day then Stage1 is considered the last which is incorrect here.

Id     date1       date2        last_col  last_date 
FRED   2020-09-14  2020-09-14   date1     2020-09-14
JANE   2019-04-13  2019-04-13   date1     2019-04-13

Using timestamps also gets some incorrect results due to the time not being accounted for:

Id   date1                   date2                    last_col  last_date   
FRED 2020-08-13 23:35:38 UTC 2020-08-13 23:49:06 UTC  date1     2020-08-13T23
JANE 2021-01-07 02:16:41 UTC null                     date2     null
AMY  2021-01-07 02:35:08 UTC 2021-01-07 23:26:20 UTC  date2     2021-01-07T23

Apparently hours are handled but not minutes or seconds. Also nulls apparently evaluate ahead of any other but should actually be discarded as they are no action taken fields.

Upvotes: 0

Views: 103

Answers (2)

berlin
berlin

Reputation: 31

@mikhail-berlyant Provided an answer given my understanding of the data at the time. A couple of modifications.

  1. Translate a Timestamp to UNIX_SECONDS to avoid the colons in the timestamp breaking the json which means it only stores to date-hour rather than date-hour-min-seconds.
  2. Use COALESCE to force nulls to zero which keeps them from becoming the solution.

Complete code as I am using it:

#standardSQL
SELECT
  t.Id,
  date1,
  date2,
  date3,
  ARRAY_AGG( STRUCT(SPLIT(col_value, ':')[
    OFFSET (0)] AS last_col,SPLIT(col_value, ':')[OFFSET(1)] AS last_date)
  ORDER BY SPLIT(col_value, ':')[OFFSET (1)] DESC   LIMIT    1 )[OFFSET  (0)].*
FROM
  `DATATABLE` t,
  UNNEST(SPLIT(TRANSLATE(to_json_STRING((
          SELECT
            AS STRUCT 
            coalesce(UNIX_SECONDS(t.date1),0) as date1,
            coalesce(UNIX_SECONDS(t.date2),0) as date2,
            coalesce(UNIX_SECONDS(t.date3),0) as date3)), '{}"', ''))) col_value
GROUP BY
  Id,  date1,  date2,  date3

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL

#standardSQL
select user, 
  array_agg(
    struct(split(col_value, ':')[offset(0)] as last_col, split(col_value, ':')[offset(1)] as last_date) 
    order by parse_date('%m/%d/%Y', split(col_value, ':')[offset(1)]) desc 
    limit 1
  )[offset(0)].* 
from `project.dataset.yourtable` t,
unnest(split(translate(to_json_string((select as struct t.* except(user))), '{}"', ''))) col_value
group by user    

If to apply to sample data from your question

with `project.dataset.yourtable` as (
  select 'Fred' user, '01/01/1980' date1, '02/02/1980' date2, '03/15/1980' date3, '04/15/1980' date4, '05/15/1980' date5 union all
  select 'Jane', '02/01/1980', '03/01/1980', '02/15/1980', '02/01/1980', '02/01/1980'
) 

output is

enter image description here

In case if your dateN columns are of timestamp data type - use below version

select user, 
  array_agg(
    struct(split(col_value, ':')[offset(0)] as last_col, regexp_extract(col_value, ':(.*)') as last_date) 
    order by regexp_extract(col_value, ':(.*)') desc 
    limit 1
  )[offset(0)].* 
from `project.dataset.yourtable` t,
unnest(split(translate(to_json_string((select as struct t.* except(user))), '{}"', ''))) col_value
group by user    

Upvotes: 1

Related Questions