R.D
R.D

Reputation: 4871

How to do this data transformation

This is my input data

GroupId Serial Action
1        1      Start
1        2      Run
1        3      Jump
1        8      End
2        9      Shop
2        10     Start
2        11     Run

For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens

Suppose k  = 1, then output will be

FirstAction  NextAction Frequency
Start Run 2
Run Jump  1
Shop Start 1

How can I do this in SQL, fast enough given the input table contains millions of entries.

Upvotes: 0

Views: 128

Answers (3)

Steve Kass
Steve Kass

Reputation: 7184

tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.

create table Actions(
  GroupId int,
  Serial int,
  "Action" varchar(20) not null,
  primary key (GroupId, Serial)
);

insert into Actions values
  (1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
  (1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
  (2,11,'Run');
go

declare @k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
  select
    Serial, 'a', "Action"
  from Actions as A
  union all
  select
    Serial-@k, 'b', "Action"
  from Actions
  as B
), Pivoted(Serial,a,b) as (
  select Serial,a,b
  from ActionsDoubled
  pivot (
    max("Action") for Tag in ([a],[b])
  ) as P
)
  select 
    a, b, count(*) as ct
    from Pivoted
    where a is not NULL and b is not NULL
    group by a,b
    order by a,b;
go

drop table Actions;

If you will be doing the same computation for various @k values on stable data, this may work better in the long run:

declare @k int = 1;
  select 
    Serial, 'a' as Tag, "Action"
  into ActionsDoubled
  from Actions as A
  union all
  select
    Serial-@k, 'b', "Action"
  from Actions
  as B;
go

create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go

with Pivoted(Serial,a,b) as (
  select Serial,a,b
  from ActionsDoubled
  pivot (
    max("Action") for Tag in ([a],[b])
  ) as P
)
  select 
    a, b, count(*) as ct
    from Pivoted
    where a is not NULL and b is not NULL
    group by a,b
    order by a,b;
go

drop table ActionsDoubled;

Upvotes: 1

hythlodayr
hythlodayr

Reputation: 2387

The problem is this: Your query has to go through EVERY row regardless.

You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.

There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.

So for instance:

--Stickler for clean formatting...
SELECT 
    a1.Action AS FirstAction, 
    a2.Action AS NextAction,
    COUNT(*) AS Frequency
FROM 
     Activities a1 JOIN Activities a2
     ON (a1.groupid = a2.groupid 
         AND a1.Serial = a2.Serial + @k)   
WHERE
     a1.groupid = 1
GROUP BY
     a1.Action,
     a2.Action;

By the way, you have an index (GroupId, Serial) on the table, right?

Upvotes: 0

Bill Karwin
Bill Karwin

Reputation: 562961

SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
 ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + @k)
GROUP BY a1.Action, a2.Action;

Upvotes: 0

Related Questions