Google Big Query SQL - Get most recent unique value by date

Question

#EDIT - Following the comments, I rephrase my question

I have a BigQuery table that i want to use to get some KPI of my application. In this table, I save each create or update as a new line in order to keep a better history. So I have several times the same data with a different state.

Example of the table :

uuid  |status     |date         
––––––|–––––––––––|––––––––––      
3     |'inactive' |2018-05-12
1     |'active'   |2018-05-10
1     |'inactive' |2018-05-08
2     |'active'   |2018-05-08
3     |'active'   |2018-05-04
2     |'inactive' |2018-04-22
3     |'inactive' |2018-04-18

We can see that we have multiple value of each data.

What I would like to get:

I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day. So for each day, the amount of 'active' entries, including those from previous days.

So with this example I should have this result :

date        | actives
____________|_________
2018-05-02  |   0
2018-05-03  |   0
2018-05-04  |   1
2018-05-05  |   1
2018-05-06  |   1
2018-05-07  |   1
2018-05-08  |   2
2018-05-09  |   2
2018-05-10  |   3
2018-05-11  |   3
2018-05-12  |   2

Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.

What I've tried:

I'm stuck with two solutions that each return a different error.

First solution :

WITH
  dates AS(
      SELECT GENERATE_DATE_ARRAY(
          DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)               
      arr_dates )
SELECT
  i_date date,
  (
  SELECT COUNT(uuid)
  FROM (
    SELECT
      uuid, status, date,
      RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
    FROM users
    WHERE
      PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
  )
  WHERE
    status = 'active'
    and rank = 1
    ## rank is the condition which causes the error
  ) users
FROM
  dates, UNNEST(arr_dates) i_date
ORDER BY i_date;

The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid. But when I try this, I got a : Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.

Second solution :

WITH
  dates AS(
      SELECT GENERATE_DATE_ARRAY(
          DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)               
      arr_dates )
SELECT
  i_date date,
  (
  SELECT
    COUNT(t1.uuid)
  FROM
    users t1
  WHERE
    t1.date = (
      SELECT MAX(t2.date)
      FROM users t2
      WHERE
        t2.uuid = t1.uuid
        ## Here that's the i_date condition which causes problem 
        AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date 
    )
    AND status='active' ) users
FROM
  dates,
  UNNEST(arr_dates) i_date
ORDER BY i_date;

Here, the second select is working too and correctly returning the number of active user for a current day. But the problem is when i try to use i_date to retrieve datas among the multiple days. And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...

Which solution is more able to succeed ? What should i change ?

And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?

Mikhail Berlyant · Accepted Answer

Below is for BigQuery Standard SQL

#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active 
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date 
-- ORDER BY date

Update to address your "rephrased" question :o)
Below example is using dummy data from your question

#standardSQL
WITH `project.dataset.users` AS (
  SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
  SELECT 1, 'active', '2018-05-10' UNION ALL
  SELECT 1, 'inactive', '2018-05-08' UNION ALL
  SELECT 2, 'active', '2018-05-08' UNION ALL
  SELECT 3, 'active', '2018-05-04' UNION ALL
  SELECT 2, 'inactive', '2018-04-22' UNION ALL
  SELECT 3, 'inactive', '2018-04-18' 
), dates AS (
  SELECT day FROM UNNEST((
    SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
    FROM `project.dataset.users`
  )) day
), active_users AS (
  SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
    SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
    FROM `project.dataset.users` u
  )
  WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day 
-- ORDER BY day

with result

Row day         actives  
1   2018-05-04  1    
2   2018-05-05  1    
3   2018-05-06  1    
4   2018-05-07  1    
5   2018-05-08  2    
6   2018-05-09  2    
7   2018-05-10  3    
8   2018-05-11  3    
9   2018-05-12  2

Google Big Query SQL - Get most recent unique value by date

Answers (2)

Related Questions