Reputation: 4871
This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.
Upvotes: 0
Views: 128
Reputation: 7184
tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare @k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-@k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various @k values on stable data, this may work better in the long run:
declare @k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-@k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;
Upvotes: 1
Reputation: 2387
The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + @k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?
Upvotes: 0
Reputation: 562961
SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + @k)
GROUP BY a1.Action, a2.Action;
Upvotes: 0