veturi
veturi

Reputation: 1971

Slow SQL query on join and subquery

I'm writing a website user behaviour analyser tool as a hobby project. It tracks down user link clicks and pages they end up to from those links. It differentiates user sessions with unique UIN identifier within clicks.

I'm writing a milestone and click report from the data but the query is extremely slow. I haven't yet found out a way to increase the performance so that it would run reasonably fast (sub 5s execution time) so if anyone could help me that'd be greatly appreciated.

The part of the query below is very fast. Running time is close to 0.05s:

declare @startDate date = '2013-01-01'
declare @endDate date = '2016-01-14'
declare @user int = 4
declare @country int = 224

select
    p.PageId,
    p.Name,

    -- count of successful page landings
    SUM(CASE WHEN m.MileStoneTypeId = 1 AND m.UserId = @user
        THEN 1
        ELSE 0
        END) AS [Successful landings],

    -- count of failed page landings
    SUM(CASE WHEN m.MileStoneTypeId = 2 AND m.UserId = @user
        THEN 1
        ELSE 0
        END) AS [Failed landings],

    -- count of unfinished page landings
    SUM(CASE WHEN m.MileStoneTypeId = 3 AND m.UserId = @user
        THEN 1
        ELSE 0
        END) AS [Unfinished landings],

from
    Page as p
inner join
    Milestone as m
        ON p.PageId = m.CampaignId 
        AND m.UserId = @user
        AND m.Created >= @startDate
        AND m.Created < @endDate
where
    p.PageCountryId = @country
group by
    p.PageId,
    p.PageName

Here is the full query which performs VERY slowly. The running time is something between 45-60 seconds. The difference is that I'm attempting to gather a count of clicks generated for a specific Page Milestone:

declare @startDate date = '2013-01-01'
declare @endDate date = '2016-01-14'
declare @user int = 4
declare @country int = 224

select
    p.PageId,
    p.Name,

    -- Unique clicks
    (SELECT 
        COUNT(DISTINCT click.UIN)
     FROM 
        Click as click 
     WHERE 
        click.PageId = p.PageId AND
        click.Created >= @startDate AND
        click.Created < @endDate AND
        click.UserId = @user
    ) as [Unique clicks],

    -- Total clicks
    (SELECT 
        COUNT(click.UIN)
     FROM 
        Click as click 
     WHERE 
        click.PageId = p.PageId AND
        click.Created >= @startDate AND
        click.Created < @endDate AND
        click.User = @user
     ) as [Total clicks],

    -- count of successful page landings
    SUM(CASE WHEN m.MileStoneTypeId = 1 AND m.UserId = @user
        THEN 1
        ELSE 0
        END) AS [Successful landings],

    -- count of failed page landings
    SUM(CASE WHEN m.MileStoneTypeId = 2 AND m.UserId = @user
        THEN 1
        ELSE 0
        END) AS [Failed landings],

    -- count of unfinished page landings
    SUM(CASE WHEN m.MileStoneTypeId = 3 AND m.UserId = @user
        THEN 1
        ELSE 0
        END) AS [Unfinished landings],

from
    Page as p
inner join
    Milestone as m
        ON p.PageId = m.CampaignId 
        AND m.UserId = @user
        AND m.Created >= @startDate
        AND m.Created < @endDate
where
    p.PageCountryId = @country
group by
    p.PageId,
    p.PageName

Executing the click count queries as a standalone queries are reasonably fast. The running time is something close to 1 second for each (DISTINCT and non-distinct) query.

This is "fast" as a standalone query:

-- Unique clicks
(SELECT 
    COUNT(DISTINCT click.UIN)
 FROM 
    Click as click 
 WHERE 
    click.PageId = p.PageId AND
    click.Created >= @startDate AND
    click.Created < @endDate AND
    click.UserId = @user
) as [Unique clicks],

This is also "fast" as a standalone query:

-- Total clicks
(SELECT 
    COUNT(click.UIN)
 FROM 
    Click as click 
 WHERE 
    click.PageId = p.PageId AND
    click.Created >= @startDate AND
    click.Created < @endDate AND
    click.User = @user
 ) as [Total clicks],

The problem arises when I attempt to combine all in a single large query. For some reason standalone queries run very fast but the combined query execution time is extremely slow.

The table with clicks has a column "UIN" which is assigned for each user when they arrive to the website. When they click a link, a row is inserted to the Click -table with User Id and UIN. The UIN differentiates between user sessions, so UserId 4 with UIN abcdef123 can have multiple identical rows. This UIN is used to calculate unique clicks and total clicks within a user session.

The Page table has approximately 1000 rows. The Milestone table has approximately 200 000 rows and the Click table has approximately 10 000 000 rows.

Any idea how I can improve the performance of the full query with unique and total clicks included?

Here's the table contents and the target output

Data from Page table

+--------+-----------------------+-----------+
| PageId |         Name          | CountryId |
+--------+-----------------------+-----------+
|   3095 | Registration          |        77 |
|   3110 | Customer registration |        77 |
|   5174 | View user details     |        77 |
+--------+-----------------------+-----------+

Data from User table

+--------+------+
| UserId | Name |
+--------+------+
|      1 | Dan  |
|      2 | Mike |
|      3 | John |
+--------+------+

Data from Clicks table

+---------+--------------------------------------+--------+-------------------------+--------+
| ClickId |                 Uin                  | UserId |         Created         | PageId |
+---------+--------------------------------------+--------+-------------------------+--------+
| 1296600 | B420D0F4-20BE-49BE-AAC9-47DD858B68DD |   4301 | 2016-01-14 12:08:03:723 |   8603 |
| 1296599 | DA5877BA-8FF5-4671-8DF9-CCCBF1555BA1 |   4418 | 2016-01-14 12:07:46:930 |   2009 |
| 1296598 | C6790CB9-6DA6-4A8B-84AA-7D2D3A4B5787 |   4276 | 2016-01-14 12:07:43:563 |   8678 |
+---------+--------------------------------------+--------+-------------------------+--------+

Data from Milestone table

+-------------+-----------------+------------+--------+-------------------------+--------+
| MilestoneId | MilestoneTypeId | CampaignId | UserId |         Created         | PageId |
+-------------+-----------------+------------+--------+-------------------------+--------+
|           1 |               1 |       1001 |      4 | 2014-02-06 13:18:04:487 |     52 |
|           2 |               1 |       1001 |      4 | 2014-02-06 13:41:01:257 |   9642 |
|           3 |               1 |       1001 |      4 | 2014-02-07 09:52:29:373 |   2393 |
+-------------+-----------------+------------+--------+-------------------------+--------+

Here's the output data I'm trying to achieve:

+---------+-----------------------+---------------+--------------+----------------------+-----------------+---------------------+
| Page Id |       Page Name       | Unique clicks | Total clicks | Successfull Landings | Failed Landings | Unfinished Landings |
+---------+-----------------------+---------------+--------------+----------------------+-----------------+---------------------+
|    3095 | Registration          |           102 |          116 |                    2 |               0 |                   0 |
|    3110 | Customer registration |             3 |            6 |                    1 |               1 |                   0 |
|    5174 | View user details     |            13 |           13 |                    0 |               1 |                   0 |
|    5178 | Edit content page     |            11 |           11 |                    1 |               0 |                   0 |
|    6217 | Add new vehicle       |            18 |           18 |                    2 |               0 |                   0 |
+---------+-----------------------+---------------+--------------+----------------------+-----------------+---------------------+

Upvotes: 0

Views: 1885

Answers (4)

David Rushton
David Rushton

Reputation: 5040

Clickstream data can be very hard to deal with, often due to the volume of records generated. But in this case I think the problem is due to the use of correlated subqueries in the SELECT clause. If you are not familiar; a correlated subquery is any subquery that refers to the outer query. These harm performance because the SQL engine is forced to evaluate the query once for each row returned. This undermines the set based nature of SQL.

I've made some changes to your sample data. As supplied I couldn't return any records to validate my resultset. I've updated values in the joining fields to address this:

Sample Data

DECLARE @Page TABLE
    (
        PageId        INT,
        Name        VARCHAR(50),
        CountryId    INT
    )
;

DECLARE @User TABLE
    (
        UserId        INT,
        Name        VARCHAR(50)
    )
;

DECLARE @Clicks TABLE
    (
        ClickId        INT,
        Uin            UNIQUEIDENTIFIER,
        UserId        INT,
        Created        DATETIME,
        PageId        INT
    )
;

DECLARE @Milestone TABLE
    (
        MiestoneId        INT,
        MilestoneTypeId    INT,
        CampaignId        INT,
        UserId            INT,
        Created            DATETIME,
        PageId            INT
    )
;




INSERT INTO @Page 
    (
        PageId,
        Name,
        CountryId
    )
VALUES
    (3095, 'Registration', 77),
    (3110, 'Customer registration', 77),
    (5174, 'View user details', 77)
;

INSERT INTO @User 
    (
        UserId,
        Name
    )
VALUES
    (4301, 'Dan'),
    (2, 'Mike'),
    (3, 'John')
;

INSERT INTO @Clicks 
    (
        ClickId,
        Uin,
        UserId,
        Created,
        PageId
    )
VALUES
    (1296600, 'B420D0F4-20BE-49BE-AAC9-47DD858B68DD', 4301, '2016-01-14 12:08:03:723', 3095),
    (1296600, 'B420D0F4-20BE-49BE-AAC9-47DD858B68DD', 4301, '2016-01-14 12:08:03:723', 3095),
    (1296599, 'DA5877BA-8FF5-4671-8DF9-CCCBF1555BA1', 4301, '2016-01-14 12:07:46:930', 3110),
    (1296598, 'C6790CB9-6DA6-4A8B-84AA-7D2D3A4B5787', 4301, '2016-01-14 12:07:43:563', 5174)
;

INSERT INTO @Milestone 
    (
        MiestoneId,
        MilestoneTypeId,
        CampaignId,
        UserId,
        Created,
        PageId
    )
VALUES
    (1, 1, 1001, 4301, '2014-01-06 13:18:04:487', 3095),
    (2, 1, 1001, 4301, '2014-01-06 13:41:01:257', 3110),
    (3, 3, 1001, 4301, '2014-01-07 09:52:29:373', 5174)
;

As you spotted in your original query, you cannot directly join Milestone to Click, as each table has a different grain. In my query I've used CTEs to return the totals from each table. The main body of my query joins the results.

Example

DECLARE @StartDate  date = '2013-01-01';
DECLARE @EndDate    date = '2016-01-15';
DECLARE @UserId     int = 4301;
DECLARE @CountryId  int = 77;


WITH Click AS
    (
        SELECT
            UserId,
            PageId,
            COUNT(DISTINCT Uin)       AS [Distinct Clicks],
            COUNT(ClickId)            AS [Total Clicks]
        FROM
            @Clicks
        WHERE
            UserId = @UserId
            AND Created BETWEEN @StartDate AND @EndDate
        GROUP BY
            UserId,
            PageId
    ),
    Milestone AS
    (
        SELECT
            UserId,
            PageId,
            SUM(CASE WHEN MileStoneTypeId = 1 THEN 1 ELSE 0 END) AS [Successful Landings],
            SUM(CASE WHEN MileStoneTypeId = 2 THEN 1 ELSE 0 END) AS [Failed Landings],
            SUM(CASE WHEN MileStoneTypeId = 3 THEN 1 ELSE 0 END) AS [Unfinished Landings]
        FROM
            @Milestone
        WHERE
            UserId = @UserId
            AND Created BETWEEN @StartDate AND @EndDate
        GROUP BY
            UserId,
            PageId
    )
SELECT
    p.PageId,
    p.Name,
    c.[Distinct Clicks],
    c.[Total Clicks],
    ms.[Successful Landings],
    ms.[Failed Landings],
    ms.[Unfinished Landings]
FROM
    @Page AS p
        INNER JOIN Click AS c            ON  c.PageId    = p.PageId
        INNER JOIN Milestone AS ms       ON  ms.PageId    = c.PageId
                                         AND ms.UserId    = c.UserId
WHERE
    p.CountryId = @CountryId
;

Upvotes: 3

warwreken
warwreken

Reputation: 104

You should turn the "clicks" into functions and call those functions through the query. Using "clicks" as subquerys would work slowly because it will run a lot of times for each row.

Upvotes: 0

Rodion
Rodion

Reputation: 886

It is slow because you making your "click" selects two times and for each row in your query.

Try to join it as you did with Milestones table and add group by user clause.

upd. please, can you provide us tables strucure and data like in next example?

declare @Page as table ( 
  PageId int, 
  etc
)
insert into @page (PageId, etc) values (3095, etc)

Upvotes: 1

Raffaello.D.Huke
Raffaello.D.Huke

Reputation: 552

don't use count(distinct), it will sort then count, the sort really cost you a lot of time. you can just distinct in your table first then count

like this:

select count(1) from (select distinct column from table);

and if you want to check what cost you the most you can use the mode below

     set showplan_all on

to check the explan for your query or you can just click the Display Estimated Execution Plan at the Microsoft SQL Server Management Studio

hope this could help you :)

Upvotes: 0

Related Questions