Reputation: 95
I've got two tables: Log(id,user,action,date) and ActionTypes(action,type). Given an action A0, and a type T0, I would like to count for each user, how many times she has used each other action Ai right after A0, but skipping the actions of Log which are not of type T0. So for example:
Log:
id user action date
----------------------------------------
1 mary start 2012-07-16 08:00:00
2 mary open 2012-07-16 09:00:00
3 john start 2012-07-16 09:00:00
4 mary play 2012-07-16 10:00:00
5 john open 2012-07-16 10:30:00
6 mary start 2012-07-16 11:00:00
7 mary jump 2012-07-16 12:00:00
8 mary close 2012-07-16 13:00:00
9 mary delete 2012-07-16 14:00:00
10 mary start 2012-07-16 15:00:00
11 mary open 2012-07-16 16:00:00
ActionTypes:
action type
--------------
start 0
open 1
play 1
jump 2
close 1
delete 1
So, given the action 'start' and the type '1' the answer will be:
user action ntimes
------------------------
mary open 2
mary close 1
john open 1
My attempt is
SELECT b.user,b.action, count(*)
FROM log a, log b
WHERE a.action='start' AND b.date>a.date AND a.user=b.user AND
1=(select type from ActionTypes where action=b.action) AND
not exists (SELECT c.action FROM log c where c.user=a.user AND
c.date>a.date and c.date<b.date and
1=(select type from ActionTypes where action=c.action))
GROUP BY b.user,b.action
Our Log table has about 1 million tuples and the query works, but it is too slow. We're using SQLServer. Any hints on how to make it faster? Thanks
Upvotes: 4
Views: 1645
Reputation: 77707
After borrowing @Nikola Markovinović's setup, I came up with the following solution:
WITH ranked AS (
SELECT
L1.[user],
L2.action,
rnk = ROW_NUMBER() OVER (PARTITION BY L1.id ORDER BY L2.date)
FROM Log L1
INNER JOIN Log L2 ON L2.[user] = L1.[user] AND L2.date > L1.date
INNER JOIN ActionType at ON L2.action = at.action
WHERE L1.action = @Action
AND at.type = @Type
)
SELECT
[user],
action,
ntimes = COUNT(*)
FROM ranked
WHERE rnk = 1
GROUP BY
[user],
action
;
Basically, this query selects from the Log
table all users' records that have the specified action, then joins that subset back to Log
to retrieve all actions of the specified type that follow those in the first subset, ranking them in the ascending order of date
along the way (using the ROW_NUMBER()
function). Then the query retrieves only rows with the rankings of 1
, groups them by user
and action
and counts rows in the groups.
You can see (and play with) a working example at SQL Fiddle.
Upvotes: 3
Reputation: 19356
Could you please give this query a try? It uses exists to test if previous chronological record is of requested type. I believe that it would be faster than self-join. I have put a demo @ Sql Fiddle.
select log.[user], log.action, count(*) ntimes
from log
inner join actiontype t
on log.action = t.action
where t.type = 1
and exists (select *
from
(select top 1 t1.type
from log l1
inner join actiontype t1
on l1.action = t1.action
where l1.[user] = log.[user]
and l1.date < log.date
and t1.type in (0, 1)
order by l1.date desc
) prevEntry
where prevEntry.type = 0
)
group by log.[user], log.action
I don't understand why is mary
\ close
in result list. Previous record is jump which is of type 2
, and should not be skipped over to get to start.
Upvotes: 3
Reputation: 3259
Your action queries and all of the relation fields it is much faster to be an integer and not a string.
The only way to do your queries faster is to change the structure of the database. The relations has to be indexed and has to be integers and not strings. For example something like this:
id user action date
----------------------------------------
1 mary 1 2012-07-16 08:00:00
2 mary 2 2012-07-16 09:00:00
3 john 3 2012-07-16 09:00:00
4 mary 1 2012-07-16 10:00:00
5 john 3 2012-07-16 10:30:00
6 mary 4 2012-07-16 11:00:00
7 mary 5 2012-07-16 12:00:00
8 mary 6 2012-07-16 13:00:00
9 mary 1 2012-07-16 14:00:00
10 mary 3 2012-07-16 15:00:00
11 mary 1 2012-07-16 16:00:00
will solve your problem.
In addition if you have from 1-9 action types you can have the action to tinyint and also if you add an id and tinyint with primary key will definetely do your queries easier (with simple joins) and will also your database it will be more flexible for future changes . For example you can have:
id action type
--------------
1 start 0
2 open 1
3 play 1
4 jump 2
5 close 1
6 delete 1
Where id is a primary key and the "action" at "Log" table has a foreign key to this id.
I think the main problem is that you don't have indexes and foreign keys relations.
Upvotes: 2
Reputation: 7038
I sightly disagree with statements:
...is much faster to be an integer and not a string
That's not exactly true, once column action
is indexed there is little difference between integers or strings.
...only way to do your queries faster is to change the structure of the database
In this case query could be optimized in number of ways:
where
condition before group by
). Here is simplified query which will produce result you need(it's written&tested on Oracle since it's been a while since I was working with ms sql server ):
select "user", action, count(*) from action_log where action not in ( --exclusion criteria select action_type."action"from action_type where action_type."type" = 1 ) group by "user", action
Upvotes: 0