Reputation: 43
I have the task to determine fact of three events on different accounts is in 1 hour window.
The solution could be like
count(distinct account_id) over (order by time_key range between 20 PRECEDING and CURRENT ROW)
and check that count() > 3
But Oracle can't use distinct function with order by clause:
ORA-30487: ORDER BY not allowed here
I have the solution below, but it seems hard
with t_data as (
select 1 as account_id, 1000 as time_key from dual union
select 1 as account_id, 1010 as time_key from dual union
select 1 as account_id, 1020 as time_key from dual union
select 1 as account_id, 1030 as time_key from dual union
select 2 as account_id, 1040 as time_key from dual union
select 3 as account_id, 1050 as time_key from dual union
select 3 as account_id, 1060 as time_key from dual union
select 3 as account_id, 1070 as time_key from dual union
select 3 as account_id, 1080 as time_key from dual union
select 3 as account_id, 1090 as time_key from dual
order by time_key
)
select *
from (
select account_id,
time_key,
max(
case
when account_id = 1 then 1
else 0
end
) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m1,
max(
case
when account_id = 2 then 1
else 0
end
) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m2,
max(
case
when account_id = 3 then 1
else 0
end
) over (order by time_key range between 20 PRECEDING and CURRENT ROW) as m3
from t_data
)
where m1 = 1 and m2 = 1 and m3 = 1
What is the simpler way to determine number of distinct events in a sliding window ?
Upvotes: 4
Views: 285
Reputation: 17944
If you are really hell-bent on using a only a single windowing clause, here is a way:
with product_of_primes as (
select t.*, round(exp(sum(ln(decode(account_id,1,2,2,3,3,5)))
over ( order by time_key range between 20 preceding
and current row ))) product from t_data t
)
select account_id, time_key from product_of_primes
where mod(product,2*3*5) = 0;
Explanation:
If you were on my team and you wrote this, I would kill you.
Full example with data:
with t_data as (
select 1 as account_id, 1000 as time_key from dual union
select 1 as account_id, 1010 as time_key from dual union
select 1 as account_id, 1020 as time_key from dual union
select 1 as account_id, 1030 as time_key from dual union
select 2 as account_id, 1040 as time_key from dual union
select 3 as account_id, 1050 as time_key from dual union
select 3 as account_id, 1060 as time_key from dual union
select 3 as account_id, 1070 as time_key from dual union
select 3 as account_id, 1080 as time_key from dual union
select 3 as account_id, 1090 as time_key from dual
order by time_key
),
product_of_primes as (
select t.*, round(exp(sum(ln(decode(account_id,1,2,2,3,3,5)))
over ( order by time_key range between 20 preceding
and current row ))) product from t_data t
)
select account_id, time_key from product_of_primes
where mod(product,2*3*5) = 0;
Results:
+------------+----------+---------+
| ACCOUNT_ID | TIME_KEY | PRODUCT |
+------------+----------+---------+
| 3 | 1050 | 30 |
+------------+----------+---------+
Upvotes: 1
Reputation: 17944
Here is a super-simple way to do it. We can work on the performance, maybe if you want to post some details about the size of your table.
select t1.account_id, t1.time_key, count(distinct t2.account_id) cnt
from t_data t1 cross join t_data t2
where t2.time_key between t1.time_key - 20 and t1.time_key
group by t1.account_id, t1.time_key
having count(distinct t2.account_id) >= 3;
Upvotes: 1
Reputation: 1270593
It is not immediately obvious to me how you do this with window functions. You can use a correlated subquery:
select t.*,
(select count(distinct t2.account_id)
from t_data t2
where t2.time_key >= t.time_key - 20 and t2.time_key <= t.time_key
)
from t_data t;
Another method -- which could conceivably have better performance -- is to treat the problem as a gaps-and-island problem. The following version returns the number of simultaneous distinct accounts at each time key:
with t as (
select account_id, min(time_key) as min_time_key, max(time_key + 20) as max_time_key
from (select t.*, sum(case when time_key - prev_time_key <= 20 then 0 else 1 end) over (order by time_key) as grp
from (select t.*, lag(time_key) over (partition by account_id order by time_key) as prev_time_key
from t_data t
) t
) t
group by account_id
)
select td.account_id, td.time_key, count(distinct t.account_id) as num_distinct
from t_data td join
t
on td.time_key between t.min_time_key and t.max_time_key
group by td.account_id, td.time_key;
Finally, if you have only 3 (or 2) account ids that you want to find and you only care about getting some examples where the max is hit, then you can do the following:
select t.*
from (select t.*,
min(account_id) over (order by time_key range between 20 preceding and 1 preceding) as min_account_id,
max(account_id) over (order by time_key range between 20 preceding and 1 preceding) as max_account_id
from t_data t
) t
where min_account_id <> max_account_id and
account_id <> min_account_id and
account_id <> max_account_id;
This gets the max and min account ids from the preceding 20 rows -- excluding the current row. If these are different from the current value, then you have three different values.
Upvotes: 1