RobinToppan
RobinToppan

Reputation: 33

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question

I have a BigQuery table that i want to use to get some KPI of my application. In this table, I save each create or update as a new line in order to keep a better history. So I have several times the same data with a different state.

Example of the table :

uuid  |status     |date         
––––––|–––––––––––|––––––––––      
3     |'inactive' |2018-05-12
1     |'active'   |2018-05-10
1     |'inactive' |2018-05-08
2     |'active'   |2018-05-08
3     |'active'   |2018-05-04
2     |'inactive' |2018-04-22
3     |'inactive' |2018-04-18

We can see that we have multiple value of each data.

What I would like to get:

I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day. So for each day, the amount of 'active' entries, including those from previous days.

So with this example I should have this result :

date        | actives
____________|_________
2018-05-02  |   0
2018-05-03  |   0
2018-05-04  |   1
2018-05-05  |   1
2018-05-06  |   1
2018-05-07  |   1
2018-05-08  |   2
2018-05-09  |   2
2018-05-10  |   3
2018-05-11  |   3
2018-05-12  |   2

Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.

What I've tried:

I'm stuck with two solutions that each return a different error.

First solution :

WITH
  dates AS(
      SELECT GENERATE_DATE_ARRAY(
          DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)               
      arr_dates )
SELECT
  i_date date,
  (
  SELECT COUNT(uuid)
  FROM (
    SELECT
      uuid, status, date,
      RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
    FROM users
    WHERE
      PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
  )
  WHERE
    status = 'active'
    and rank = 1
    ## rank is the condition which causes the error
  ) users
FROM
  dates, UNNEST(arr_dates) i_date
ORDER BY i_date;

The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid. But when I try this, I got a : Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.

Second solution :

WITH
  dates AS(
      SELECT GENERATE_DATE_ARRAY(
          DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)               
      arr_dates )
SELECT
  i_date date,
  (
  SELECT
    COUNT(t1.uuid)
  FROM
    users t1
  WHERE
    t1.date = (
      SELECT MAX(t2.date)
      FROM users t2
      WHERE
        t2.uuid = t1.uuid
        ## Here that's the i_date condition which causes problem 
        AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date 
    )
    AND status='active' ) users
FROM
  dates,
  UNNEST(arr_dates) i_date
ORDER BY i_date;

Here, the second select is working too and correctly returning the number of active user for a current day. But the problem is when i try to use i_date to retrieve datas among the multiple days. And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...

Which solution is more able to succeed ? What should i change ?

And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?

Upvotes: 3

Views: 2793

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Below is for BigQuery Standard SQL

#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active 
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date 
-- ORDER BY date   

Update to address your "rephrased" question :o)
Below example is using dummy data from your question

#standardSQL
WITH `project.dataset.users` AS (
  SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
  SELECT 1, 'active', '2018-05-10' UNION ALL
  SELECT 1, 'inactive', '2018-05-08' UNION ALL
  SELECT 2, 'active', '2018-05-08' UNION ALL
  SELECT 3, 'active', '2018-05-04' UNION ALL
  SELECT 2, 'inactive', '2018-04-22' UNION ALL
  SELECT 3, 'inactive', '2018-04-18' 
), dates AS (
  SELECT day FROM UNNEST((
    SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
    FROM `project.dataset.users`
  )) day
), active_users AS (
  SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
    SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
    FROM `project.dataset.users` u
  )
  WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day 
-- ORDER BY day

with result

Row day         actives  
1   2018-05-04  1    
2   2018-05-05  1    
3   2018-05-06  1    
4   2018-05-07  1    
5   2018-05-08  2    
6   2018-05-09  2    
7   2018-05-10  3    
8   2018-05-11  3    
9   2018-05-12  2    

Upvotes: 3

Gordon Linoff
Gordon Linoff

Reputation: 1271121

I think this -- or something similar -- will do what you want:

SELECT day,
       coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
           ) AS day left join
     (select date, sum(countif(status = 'active')) over (order by date) as running_actives,
             sum(countif(status = 'active')) over (order by date) as running_inactives
      from t
      group by date
     ) a
     on a.date = day
order by day;

The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.

In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary

Upvotes: 0

Related Questions