SAL PIMIENTA
SAL PIMIENTA

Reputation: 95

improving performance in sql with multiple tables

I've got two tables: Log(id,user,action,date) and ActionTypes(action,type). Given an action A0, and a type T0, I would like to count for each user, how many times she has used each other action Ai right after A0, but skipping the actions of Log which are not of type T0. So for example:

Log:

id   user   action        date
----------------------------------------
1    mary   start   2012-07-16 08:00:00
2    mary   open    2012-07-16 09:00:00
3    john   start   2012-07-16 09:00:00
4    mary   play    2012-07-16 10:00:00
5    john   open    2012-07-16 10:30:00
6    mary   start   2012-07-16 11:00:00
7    mary   jump    2012-07-16 12:00:00
8    mary   close   2012-07-16 13:00:00
9    mary   delete  2012-07-16 14:00:00
10   mary   start   2012-07-16 15:00:00
11   mary   open    2012-07-16 16:00:00

ActionTypes:

action  type
--------------
start   0
open    1
play    1
jump    2
close   1
delete  1

So, given the action 'start' and the type '1' the answer will be:

user   action    ntimes
------------------------
mary   open      2
mary   close     1
john   open      1

My attempt is

SELECT b.user,b.action, count(*)
FROM log a, log b
WHERE a.action='start' AND b.date>a.date AND a.user=b.user AND
      1=(select type from ActionTypes where action=b.action) AND
      not exists (SELECT c.action FROM log c where c.user=a.user AND                  
                  c.date>a.date and c.date<b.date and                            
                  1=(select type from ActionTypes where action=c.action))
GROUP BY b.user,b.action

Our Log table has about 1 million tuples and the query works, but it is too slow. We're using SQLServer. Any hints on how to make it faster? Thanks

Upvotes: 4

Views: 1645

Answers (4)

Andriy M
Andriy M

Reputation: 77707

After borrowing @Nikola Markovinović's setup, I came up with the following solution:

WITH ranked AS (
  SELECT
    L1.[user],
    L2.action,
    rnk = ROW_NUMBER() OVER (PARTITION BY L1.id ORDER BY L2.date)
  FROM Log L1
    INNER JOIN Log L2 ON L2.[user] = L1.[user] AND L2.date > L1.date
    INNER JOIN ActionType at ON L2.action = at.action
  WHERE L1.action = @Action
    AND at.type   = @Type
)
SELECT
  [user],
  action,
  ntimes = COUNT(*)
FROM ranked
WHERE rnk = 1
GROUP BY
  [user],
  action
;

Basically, this query selects from the Log table all users' records that have the specified action, then joins that subset back to Log to retrieve all actions of the specified type that follow those in the first subset, ranking them in the ascending order of date along the way (using the ROW_NUMBER() function). Then the query retrieves only rows with the rankings of 1, groups them by user and action and counts rows in the groups.

You can see (and play with) a working example at SQL Fiddle.

Upvotes: 3

Nikola Markovinović
Nikola Markovinović

Reputation: 19356

Could you please give this query a try? It uses exists to test if previous chronological record is of requested type. I believe that it would be faster than self-join. I have put a demo @ Sql Fiddle.

select log.[user], log.action, count(*) ntimes
  from log
 inner join actiontype t
    on log.action = t.action
 where t.type = 1
   and exists (select *
                 from 
                   (select top 1 t1.type
                      from log l1
                     inner join actiontype t1
                        on l1.action = t1.action
                     where l1.[user] = log.[user]
                       and l1.date < log.date
                       and t1.type in (0, 1)
                     order by l1.date desc
                   ) prevEntry
                where prevEntry.type = 0
               )
 group by log.[user], log.action

I don't understand why is mary \ close in result list. Previous record is jump which is of type 2, and should not be skipped over to get to start.

Upvotes: 3

John Skoubourdis
John Skoubourdis

Reputation: 3259

Your action queries and all of the relation fields it is much faster to be an integer and not a string.

The only way to do your queries faster is to change the structure of the database. The relations has to be indexed and has to be integers and not strings. For example something like this:

id   user   action        date
----------------------------------------
1    mary   1   2012-07-16 08:00:00
2    mary   2   2012-07-16 09:00:00
3    john   3   2012-07-16 09:00:00
4    mary   1   2012-07-16 10:00:00
5    john   3   2012-07-16 10:30:00
6    mary   4   2012-07-16 11:00:00
7    mary   5   2012-07-16 12:00:00
8    mary   6   2012-07-16 13:00:00
9    mary   1   2012-07-16 14:00:00
10   mary   3   2012-07-16 15:00:00
11   mary   1   2012-07-16 16:00:00

will solve your problem.

In addition if you have from 1-9 action types you can have the action to tinyint and also if you add an id and tinyint with primary key will definetely do your queries easier (with simple joins) and will also your database it will be more flexible for future changes . For example you can have:

id action  type
--------------
1  start   0
2  open    1
3  play    1
4  jump    2
5  close   1
6  delete  1

Where id is a primary key and the "action" at "Log" table has a foreign key to this id.

I think the main problem is that you don't have indexes and foreign keys relations.

Upvotes: 2

Petro Semeniuk
Petro Semeniuk

Reputation: 7038

I sightly disagree with statements:

  1. ...is much faster to be an integer and not a string

    That's not exactly true, once column action is indexed there is little difference between integers or strings.

  2. ...only way to do your queries faster is to change the structure of the database

    In this case query could be optimized in number of ways:

    • Avoid filtering on joined data set (Log x ActionTypes) and try to do filtering earlier(in example below filtering happens in inner sub select).
    • Avoid repetitive filtering condition(where). Even though sql server internally will optimize this queriers duplication usually sign that you're doing calculation few times and most of the time you could find solution in which you could put condition only once(in example below you could put where condition before group by ).
    • Your best friend is 'SQL Query Analyzer(Optimizer)'. Its built-in tool in Sql Server Manager Studio and it will show you sql query execution cost taking into account data volume. It's really good tool and helps to find bottlenecks in queries.

    Here is simplified query which will produce result you need(it's written&tested on Oracle since it's been a while since I was working with ms sql server ):

select
  "user",
  action,
  count(*)
from action_log
where action not in ( --exclusion criteria
    select action_type."action"from action_type where action_type."type" = 1
)
group by "user", action

Upvotes: 0

Related Questions