How to determine the number of distinct events in a sliding window using Oracle

I have the task to determine fact of three events on different accounts is in 1 hour window.

The solution could be like

count(distinct account_id) over (order by time_key range between 20 PRECEDING and CURRENT ROW)

and check that count() > 3

But Oracle can't use distinct function with order by clause:

ORA-30487: ORDER BY not allowed here

I have the solution below, but it seems hard

with t_data as (
select 1 as account_id, 1000 as time_key from dual union
select 1 as account_id, 1010 as time_key from dual union
select 1 as account_id, 1020 as time_key from dual union
select 1 as account_id, 1030 as time_key from dual union
select 2 as account_id, 1040 as time_key from dual union
select 3 as account_id, 1050 as time_key from dual union
select 3 as account_id, 1060 as time_key from dual union
select 3 as account_id, 1070 as time_key from dual union
select 3 as account_id, 1080 as time_key from dual union
select 3 as account_id, 1090 as time_key from dual
order by time_key
)

select *
from (
  select  account_id,
          time_key,
          max(
              case 
               when account_id = 1 then 1
               else 0
              end
          ) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m1,
          max(
              case 
               when account_id = 2 then 1
               else 0
              end
          ) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m2,
          max(
              case 
               when account_id = 3 then 1
               else 0
              end
          ) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m3
  from t_data
)
where m1 = 1 and m2 = 1 and m3 = 1

What is the simpler way to determine number of distinct events in a sliding window ?

Upvotes: 4

Views: 285

Answers (3)

Matthew McPeak
Matthew McPeak

Reputation: 17944

If you are really hell-bent on using a only a single windowing clause, here is a way:

with product_of_primes as (
select t.*, round(exp(sum(ln(decode(account_id,1,2,2,3,3,5))) 
       over ( order by time_key range between 20 preceding
                   and current row ))) product from t_data t
)
select account_id, time_key from product_of_primes
where mod(product,2*3*5) = 0;

Explanation:

  • Convert each distinct account_id into a prime number. So, the 1st account_id gets 2, the next gets 3, the next gets 5.
  • Take the natural log of that number
  • Sum the natural logs for all the events in the last hour (i.e., in our window), remembering that ln(a)+ln(b) = ln(a*b)
  • Take e to the power of the sum
  • (So far, this is just a long winded way to multiply all the prime numbers we mapped our account_ids to)
  • Any row where this result is evenly divisible by all three prime numbers we used (2,3,5 -- so, that is divisible by 30) has all three distinct account_ids in it's window.

If you were on my team and you wrote this, I would kill you.

Full example with data:

with t_data as (
select 1 as account_id, 1000 as time_key from dual union
select 1 as account_id, 1010 as time_key from dual union
select 1 as account_id, 1020 as time_key from dual union
select 1 as account_id, 1030 as time_key from dual union
select 2 as account_id, 1040 as time_key from dual union
select 3 as account_id, 1050 as time_key from dual union
select 3 as account_id, 1060 as time_key from dual union
select 3 as account_id, 1070 as time_key from dual union
select 3 as account_id, 1080 as time_key from dual union
select 3 as account_id, 1090 as time_key from dual
order by time_key
),
product_of_primes as (
select t.*, round(exp(sum(ln(decode(account_id,1,2,2,3,3,5))) 
        over ( order by time_key range between 20 preceding 
               and current row ))) product from t_data t
)
select account_id, time_key from product_of_primes
where mod(product,2*3*5) = 0;

Results:

+------------+----------+---------+
| ACCOUNT_ID | TIME_KEY | PRODUCT |
+------------+----------+---------+
|          3 |     1050 |      30 |
+------------+----------+---------+

Upvotes: 1

Matthew McPeak
Matthew McPeak

Reputation: 17944

Here is a super-simple way to do it. We can work on the performance, maybe if you want to post some details about the size of your table.

select t1.account_id, t1.time_key, count(distinct t2.account_id) cnt
from t_data t1 cross join t_data t2
where t2.time_key between t1.time_key - 20 and t1.time_key
group by t1.account_id, t1.time_key
having count(distinct t2.account_id) >= 3;

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270593

It is not immediately obvious to me how you do this with window functions. You can use a correlated subquery:

select t.*,
       (select count(distinct t2.account_id)
        from t_data t2
        where t2.time_key >= t.time_key - 20 and t2.time_key <= t.time_key
       )
from t_data t;

Another method -- which could conceivably have better performance -- is to treat the problem as a gaps-and-island problem. The following version returns the number of simultaneous distinct accounts at each time key:

with t as (
      select account_id, min(time_key) as min_time_key, max(time_key + 20) as max_time_key
      from (select t.*, sum(case when time_key - prev_time_key <= 20 then 0 else 1 end) over (order by time_key) as grp
            from (select t.*, lag(time_key) over (partition by account_id order by time_key) as prev_time_key
                  from t_data t
                 ) t
           ) t
      group by account_id
     )
select td.account_id, td.time_key, count(distinct t.account_id) as num_distinct
from t_data td join
     t
     on td.time_key between t.min_time_key and t.max_time_key
group by td.account_id, td.time_key;

Finally, if you have only 3 (or 2) account ids that you want to find and you only care about getting some examples where the max is hit, then you can do the following:

select t.*
from (select t.*,
             min(account_id) over (order by time_key range between 20 preceding and 1 preceding) as min_account_id,
             max(account_id) over (order by time_key range between 20 preceding and 1 preceding) as max_account_id
      from t_data t
     ) t
where min_account_id <> max_account_id and
      account_id <> min_account_id and
      account_id <> max_account_id;

This gets the max and min account ids from the preceding 20 rows -- excluding the current row. If these are different from the current value, then you have three different values.

Upvotes: 1

Related Questions