Reputation: 1971
I'm writing a website user behaviour analyser tool as a hobby project. It tracks down user link clicks and pages they end up to from those links. It differentiates user sessions with unique UIN identifier within clicks.
I'm writing a milestone and click report from the data but the query is extremely slow. I haven't yet found out a way to increase the performance so that it would run reasonably fast (sub 5s execution time) so if anyone could help me that'd be greatly appreciated.
The part of the query below is very fast. Running time is close to 0.05s:
declare @startDate date = '2013-01-01'
declare @endDate date = '2016-01-14'
declare @user int = 4
declare @country int = 224
select
p.PageId,
p.Name,
-- count of successful page landings
SUM(CASE WHEN m.MileStoneTypeId = 1 AND m.UserId = @user
THEN 1
ELSE 0
END) AS [Successful landings],
-- count of failed page landings
SUM(CASE WHEN m.MileStoneTypeId = 2 AND m.UserId = @user
THEN 1
ELSE 0
END) AS [Failed landings],
-- count of unfinished page landings
SUM(CASE WHEN m.MileStoneTypeId = 3 AND m.UserId = @user
THEN 1
ELSE 0
END) AS [Unfinished landings],
from
Page as p
inner join
Milestone as m
ON p.PageId = m.CampaignId
AND m.UserId = @user
AND m.Created >= @startDate
AND m.Created < @endDate
where
p.PageCountryId = @country
group by
p.PageId,
p.PageName
Here is the full query which performs VERY slowly. The running time is something between 45-60 seconds. The difference is that I'm attempting to gather a count of clicks generated for a specific Page Milestone:
declare @startDate date = '2013-01-01'
declare @endDate date = '2016-01-14'
declare @user int = 4
declare @country int = 224
select
p.PageId,
p.Name,
-- Unique clicks
(SELECT
COUNT(DISTINCT click.UIN)
FROM
Click as click
WHERE
click.PageId = p.PageId AND
click.Created >= @startDate AND
click.Created < @endDate AND
click.UserId = @user
) as [Unique clicks],
-- Total clicks
(SELECT
COUNT(click.UIN)
FROM
Click as click
WHERE
click.PageId = p.PageId AND
click.Created >= @startDate AND
click.Created < @endDate AND
click.User = @user
) as [Total clicks],
-- count of successful page landings
SUM(CASE WHEN m.MileStoneTypeId = 1 AND m.UserId = @user
THEN 1
ELSE 0
END) AS [Successful landings],
-- count of failed page landings
SUM(CASE WHEN m.MileStoneTypeId = 2 AND m.UserId = @user
THEN 1
ELSE 0
END) AS [Failed landings],
-- count of unfinished page landings
SUM(CASE WHEN m.MileStoneTypeId = 3 AND m.UserId = @user
THEN 1
ELSE 0
END) AS [Unfinished landings],
from
Page as p
inner join
Milestone as m
ON p.PageId = m.CampaignId
AND m.UserId = @user
AND m.Created >= @startDate
AND m.Created < @endDate
where
p.PageCountryId = @country
group by
p.PageId,
p.PageName
Executing the click count queries as a standalone queries are reasonably fast. The running time is something close to 1 second for each (DISTINCT and non-distinct) query.
This is "fast" as a standalone query:
-- Unique clicks
(SELECT
COUNT(DISTINCT click.UIN)
FROM
Click as click
WHERE
click.PageId = p.PageId AND
click.Created >= @startDate AND
click.Created < @endDate AND
click.UserId = @user
) as [Unique clicks],
This is also "fast" as a standalone query:
-- Total clicks
(SELECT
COUNT(click.UIN)
FROM
Click as click
WHERE
click.PageId = p.PageId AND
click.Created >= @startDate AND
click.Created < @endDate AND
click.User = @user
) as [Total clicks],
The problem arises when I attempt to combine all in a single large query. For some reason standalone queries run very fast but the combined query execution time is extremely slow.
The table with clicks has a column "UIN" which is assigned for each user when they arrive to the website. When they click a link, a row is inserted to the Click -table with User Id and UIN. The UIN differentiates between user sessions, so UserId 4 with UIN abcdef123 can have multiple identical rows. This UIN is used to calculate unique clicks and total clicks within a user session.
The Page table has approximately 1000 rows. The Milestone table has approximately 200 000 rows and the Click table has approximately 10 000 000 rows.
Any idea how I can improve the performance of the full query with unique and total clicks included?
Here's the table contents and the target output
Data from Page table
+--------+-----------------------+-----------+
| PageId | Name | CountryId |
+--------+-----------------------+-----------+
| 3095 | Registration | 77 |
| 3110 | Customer registration | 77 |
| 5174 | View user details | 77 |
+--------+-----------------------+-----------+
Data from User table
+--------+------+
| UserId | Name |
+--------+------+
| 1 | Dan |
| 2 | Mike |
| 3 | John |
+--------+------+
Data from Clicks table
+---------+--------------------------------------+--------+-------------------------+--------+
| ClickId | Uin | UserId | Created | PageId |
+---------+--------------------------------------+--------+-------------------------+--------+
| 1296600 | B420D0F4-20BE-49BE-AAC9-47DD858B68DD | 4301 | 2016-01-14 12:08:03:723 | 8603 |
| 1296599 | DA5877BA-8FF5-4671-8DF9-CCCBF1555BA1 | 4418 | 2016-01-14 12:07:46:930 | 2009 |
| 1296598 | C6790CB9-6DA6-4A8B-84AA-7D2D3A4B5787 | 4276 | 2016-01-14 12:07:43:563 | 8678 |
+---------+--------------------------------------+--------+-------------------------+--------+
Data from Milestone table
+-------------+-----------------+------------+--------+-------------------------+--------+
| MilestoneId | MilestoneTypeId | CampaignId | UserId | Created | PageId |
+-------------+-----------------+------------+--------+-------------------------+--------+
| 1 | 1 | 1001 | 4 | 2014-02-06 13:18:04:487 | 52 |
| 2 | 1 | 1001 | 4 | 2014-02-06 13:41:01:257 | 9642 |
| 3 | 1 | 1001 | 4 | 2014-02-07 09:52:29:373 | 2393 |
+-------------+-----------------+------------+--------+-------------------------+--------+
Here's the output data I'm trying to achieve:
+---------+-----------------------+---------------+--------------+----------------------+-----------------+---------------------+
| Page Id | Page Name | Unique clicks | Total clicks | Successfull Landings | Failed Landings | Unfinished Landings |
+---------+-----------------------+---------------+--------------+----------------------+-----------------+---------------------+
| 3095 | Registration | 102 | 116 | 2 | 0 | 0 |
| 3110 | Customer registration | 3 | 6 | 1 | 1 | 0 |
| 5174 | View user details | 13 | 13 | 0 | 1 | 0 |
| 5178 | Edit content page | 11 | 11 | 1 | 0 | 0 |
| 6217 | Add new vehicle | 18 | 18 | 2 | 0 | 0 |
+---------+-----------------------+---------------+--------------+----------------------+-----------------+---------------------+
Upvotes: 0
Views: 1885
Reputation: 5040
Clickstream data can be very hard to deal with, often due to the volume of records generated. But in this case I think the problem is due to the use of correlated subqueries in the SELECT clause. If you are not familiar; a correlated subquery is any subquery that refers to the outer query. These harm performance because the SQL engine is forced to evaluate the query once for each row returned. This undermines the set based nature of SQL.
I've made some changes to your sample data. As supplied I couldn't return any records to validate my resultset. I've updated values in the joining fields to address this:
Sample Data
DECLARE @Page TABLE
(
PageId INT,
Name VARCHAR(50),
CountryId INT
)
;
DECLARE @User TABLE
(
UserId INT,
Name VARCHAR(50)
)
;
DECLARE @Clicks TABLE
(
ClickId INT,
Uin UNIQUEIDENTIFIER,
UserId INT,
Created DATETIME,
PageId INT
)
;
DECLARE @Milestone TABLE
(
MiestoneId INT,
MilestoneTypeId INT,
CampaignId INT,
UserId INT,
Created DATETIME,
PageId INT
)
;
INSERT INTO @Page
(
PageId,
Name,
CountryId
)
VALUES
(3095, 'Registration', 77),
(3110, 'Customer registration', 77),
(5174, 'View user details', 77)
;
INSERT INTO @User
(
UserId,
Name
)
VALUES
(4301, 'Dan'),
(2, 'Mike'),
(3, 'John')
;
INSERT INTO @Clicks
(
ClickId,
Uin,
UserId,
Created,
PageId
)
VALUES
(1296600, 'B420D0F4-20BE-49BE-AAC9-47DD858B68DD', 4301, '2016-01-14 12:08:03:723', 3095),
(1296600, 'B420D0F4-20BE-49BE-AAC9-47DD858B68DD', 4301, '2016-01-14 12:08:03:723', 3095),
(1296599, 'DA5877BA-8FF5-4671-8DF9-CCCBF1555BA1', 4301, '2016-01-14 12:07:46:930', 3110),
(1296598, 'C6790CB9-6DA6-4A8B-84AA-7D2D3A4B5787', 4301, '2016-01-14 12:07:43:563', 5174)
;
INSERT INTO @Milestone
(
MiestoneId,
MilestoneTypeId,
CampaignId,
UserId,
Created,
PageId
)
VALUES
(1, 1, 1001, 4301, '2014-01-06 13:18:04:487', 3095),
(2, 1, 1001, 4301, '2014-01-06 13:41:01:257', 3110),
(3, 3, 1001, 4301, '2014-01-07 09:52:29:373', 5174)
;
As you spotted in your original query, you cannot directly join Milestone to Click, as each table has a different grain. In my query I've used CTEs to return the totals from each table. The main body of my query joins the results.
Example
DECLARE @StartDate date = '2013-01-01';
DECLARE @EndDate date = '2016-01-15';
DECLARE @UserId int = 4301;
DECLARE @CountryId int = 77;
WITH Click AS
(
SELECT
UserId,
PageId,
COUNT(DISTINCT Uin) AS [Distinct Clicks],
COUNT(ClickId) AS [Total Clicks]
FROM
@Clicks
WHERE
UserId = @UserId
AND Created BETWEEN @StartDate AND @EndDate
GROUP BY
UserId,
PageId
),
Milestone AS
(
SELECT
UserId,
PageId,
SUM(CASE WHEN MileStoneTypeId = 1 THEN 1 ELSE 0 END) AS [Successful Landings],
SUM(CASE WHEN MileStoneTypeId = 2 THEN 1 ELSE 0 END) AS [Failed Landings],
SUM(CASE WHEN MileStoneTypeId = 3 THEN 1 ELSE 0 END) AS [Unfinished Landings]
FROM
@Milestone
WHERE
UserId = @UserId
AND Created BETWEEN @StartDate AND @EndDate
GROUP BY
UserId,
PageId
)
SELECT
p.PageId,
p.Name,
c.[Distinct Clicks],
c.[Total Clicks],
ms.[Successful Landings],
ms.[Failed Landings],
ms.[Unfinished Landings]
FROM
@Page AS p
INNER JOIN Click AS c ON c.PageId = p.PageId
INNER JOIN Milestone AS ms ON ms.PageId = c.PageId
AND ms.UserId = c.UserId
WHERE
p.CountryId = @CountryId
;
Upvotes: 3
Reputation: 104
You should turn the "clicks" into functions and call those functions through the query. Using "clicks" as subquerys would work slowly because it will run a lot of times for each row.
Upvotes: 0
Reputation: 886
It is slow because you making your "click" selects two times and for each row in your query.
Try to join it as you did with Milestones table and add group by user
clause.
upd. please, can you provide us tables strucure and data like in next example?
declare @Page as table (
PageId int,
etc
)
insert into @page (PageId, etc) values (3095, etc)
Upvotes: 1
Reputation: 552
don't use count(distinct), it will sort then count, the sort really cost you a lot of time. you can just distinct in your table first then count
like this:
select count(1) from (select distinct column from table);
and if you want to check what cost you the most you can use the mode below
set showplan_all on
to check the explan for your query or you can just click the Display Estimated Execution Plan at the Microsoft SQL Server Management Studio
hope this could help you :)
Upvotes: 0