Reputation: 43

How to determine the number of distinct events in a sliding window using Oracle

I have the task to determine fact of three events on different accounts is in 1 hour window.

The solution could be like

count(distinct account_id) over (order by time_key range between 20 PRECEDING and CURRENT ROW)

and check that count() > 3

But Oracle can't use distinct function with order by clause:

ORA-30487: ORDER BY not allowed here

I have the solution below, but it seems hard

with t_data as (
select 1 as account_id, 1000 as time_key from dual union
select 1 as account_id, 1010 as time_key from dual union
select 1 as account_id, 1020 as time_key from dual union
select 1 as account_id, 1030 as time_key from dual union
select 2 as account_id, 1040 as time_key from dual union
select 3 as account_id, 1050 as time_key from dual union
select 3 as account_id, 1060 as time_key from dual union
select 3 as account_id, 1070 as time_key from dual union
select 3 as account_id, 1080 as time_key from dual union
select 3 as account_id, 1090 as time_key from dual
order by time_key
)

select *
from (
  select  account_id,
          time_key,
          max(
              case 
               when account_id = 1 then 1
               else 0
              end
          ) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m1,
          max(
              case 
               when account_id = 2 then 1
               else 0
              end
          ) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m2,
          max(
              case 
               when account_id = 3 then 1
               else 0
              end
          ) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m3
  from t_data
)
where m1 = 1 and m2 = 1 and m3 = 1

What is the simpler way to determine number of distinct events in a sliding window ?

Upvotes: 4

Answers (3)

Matthew McPeak

Reputation: 17944

If you are really hell-bent on using a only a single windowing clause, here is a way:

with product_of_primes as (
select t.*, round(exp(sum(ln(decode(account_id,1,2,2,3,3,5))) 
       over ( order by time_key range between 20 preceding
                   and current row ))) product from t_data t
)
select account_id, time_key from product_of_primes
where mod(product,2*3*5) = 0;

Explanation:

Convert each distinct account_id into a prime number. So, the 1st account_id gets 2, the next gets 3, the next gets 5.
Take the natural log of that number
Sum the natural logs for all the events in the last hour (i.e., in our window), remembering that ln(a)+ln(b) = ln(a*b)
Take e to the power of the sum
(So far, this is just a long winded way to multiply all the prime numbers we mapped our account_ids to)
Any row where this result is evenly divisible by all three prime numbers we used (2,3,5 -- so, that is divisible by 30) has all three distinct account_ids in it's window.

If you were on my team and you wrote this, I would kill you.

Full example with data:

with t_data as (
select 1 as account_id, 1000 as time_key from dual union
select 1 as account_id, 1010 as time_key from dual union
select 1 as account_id, 1020 as time_key from dual union
select 1 as account_id, 1030 as time_key from dual union
select 2 as account_id, 1040 as time_key from dual union
select 3 as account_id, 1050 as time_key from dual union
select 3 as account_id, 1060 as time_key from dual union
select 3 as account_id, 1070 as time_key from dual union
select 3 as account_id, 1080 as time_key from dual union
select 3 as account_id, 1090 as time_key from dual
order by time_key
),
product_of_primes as (
select t.*, round(exp(sum(ln(decode(account_id,1,2,2,3,3,5))) 
        over ( order by time_key range between 20 preceding 
               and current row ))) product from t_data t
)
select account_id, time_key from product_of_primes
where mod(product,2*3*5) = 0;

Results:

+------------+----------+---------+
| ACCOUNT_ID | TIME_KEY | PRODUCT |
+------------+----------+---------+
|          3 |     1050 |      30 |
+------------+----------+---------+

Upvotes: 1

Matthew McPeak

Reputation: 17944

Here is a super-simple way to do it. We can work on the performance, maybe if you want to post some details about the size of your table.

select t1.account_id, t1.time_key, count(distinct t2.account_id) cnt
from t_data t1 cross join t_data t2
where t2.time_key between t1.time_key - 20 and t1.time_key
group by t1.account_id, t1.time_key
having count(distinct t2.account_id) >= 3;

Upvotes: 1

Gordon Linoff

Reputation: 1270593

It is not immediately obvious to me how you do this with window functions. You can use a correlated subquery:

select t.*,
       (select count(distinct t2.account_id)
        from t_data t2
        where t2.time_key >= t.time_key - 20 and t2.time_key <= t.time_key
       )
from t_data t;

Another method -- which could conceivably have better performance -- is to treat the problem as a gaps-and-island problem. The following version returns the number of simultaneous distinct accounts at each time key:

with t as (
      select account_id, min(time_key) as min_time_key, max(time_key + 20) as max_time_key
      from (select t.*, sum(case when time_key - prev_time_key <= 20 then 0 else 1 end) over (order by time_key) as grp
            from (select t.*, lag(time_key) over (partition by account_id order by time_key) as prev_time_key
                  from t_data t
                 ) t
           ) t
      group by account_id
     )
select td.account_id, td.time_key, count(distinct t.account_id) as num_distinct
from t_data td join
     t
     on td.time_key between t.min_time_key and t.max_time_key
group by td.account_id, td.time_key;

Finally, if you have only 3 (or 2) account ids that you want to find and you only care about getting some examples where the max is hit, then you can do the following:

select t.*
from (select t.*,
             min(account_id) over (order by time_key range between 20 preceding and 1 preceding) as min_account_id,
             max(account_id) over (order by time_key range between 20 preceding and 1 preceding) as max_account_id
      from t_data t
     ) t
where min_account_id <> max_account_id and
      account_id <> min_account_id and
      account_id <> max_account_id;

This gets the max and min account ids from the preceding 20 rows -- excluding the current row. If these are different from the current value, then you have three different values.

Upvotes: 1

How to determine the number of distinct events in a sliding window using Oracle

Answers (3)

Related Questions