Reputation: 423
I have a table which is a link table from objects in my SQL Server 2012 database, (annonsid, annonsid2
). This table is used to create chains of triangle or even rectangles to see who can swap with who.
This is the query I use on the table Matching_IDs
which has 1,5 million rows in it, producing 14 million possible chains using this query:
SELECT COUNT(*)
FROM Matching_IDs AS m
INNER JOIN Matching_IDs AS m2
ON m.annonsid2 = m2.annonsid
INNER JOIN Matching_IDs AS m3
ON m2.annonsid2 = m3.annonsid
AND m.annonsid = m3.annonsid2
I must improve performance to take maybe 1 second or less, Is there a faster way to do this? The query takes about 1 minute on my computer. I normally use a WHERE m.annonsid=x
, but it takes just the same amount of time, cause it has to go through all possible combinations anyway.
Update: the latest query plan
|--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
|--Stream Aggregate(DEFINE:([globalagg1011]=SUM([partialagg1010])))
|--Parallelism(Gather Streams)
|--Stream Aggregate(DEFINE:([partialagg1010]=Count(*)))
|--Hash Match(Inner Join, HASH:([m2].[annonsid2], [m2].[annonsid])=([m3].[annonsid], [m].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m2].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m3].[annonsid] AND [MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m2].[annonsid]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m2].[annonsid2], [m2].[annonsid]))
| |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m2]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid], [m].[annonsid2]))
|--Merge Join(Inner Join, MANY-TO-MANY MERGE:([m].[annonsid])=([m3].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m].[annonsid]=[MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m3].[annonsid2]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[annonsid]), ORDER BY:([m].[annonsid] ASC))
| |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133152] AS [m]), ORDERED FORWARD)
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid2]), ORDER BY:([m3].[annonsid2] ASC))
|--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m3]), ORDERED FORWARD)
Upvotes: 8
Views: 11247
Reputation: 15816
You could denormalize the data by adding a table RelatedIds
with AnnonsId
, RelatedAnnonId
and Distance
. For every value of AnnonsId
the table would contains rows for each RelatedAnnonId
and the number of relations that need to be traversed to reach it, aka Distance
. Triggers on the existing MatchingIds
table would maintain the new table with some configured maximum value for Distance
, e.g. 3 to handle rectangular shares. Index the table on (AnnonsId
, Distance
).
Edit: An index on (Distance
, AnnonsId
) will allow you to quickly find rows that have enough related entries to form a particular shape. Adding a column for MaxDistance
may be useful if you want to be able to exclude rows based on, for example, having a triangular but not rectangular relationship.
The new query would inner join RelatedIds as RI on RI.AnnonsId = m.AnnonsId and RI.Distance <= @MaxDistance
with the desired "shape" dictating the value of @MaxDistance
.
It should provide much better performance on the select
. Downsides are another table with a large number of rows and overhead of the triggers when altering the MatchingIds
table.
Example: There are two entries in Matching_IDs
: (1,2) and (2,3).
The new table would contain 3 entries:
1-> 2: distance = 1
1-> 3: distance = 2 (it takes one intermediate 'node' to go from 1 to 3)
2-> 3: distance = 1
adding one more entry to the matching ids (3,1) would result in another entry:
1-> 1: distance = 3
And voilá: You found a triangle (distance=3).
Now, to find all triangles, simply do this:
select *
from RelatedIds
where AnnonsId=RelatedAnnonId
and Distance=3
Upvotes: 0
Reputation: 688
1 - change select Count(*)
to Count(1)
or Count(id)
2 - Write set Nocount on
at the first of your Stored procedure or at the first of your Query
3 - Use index on annonsid
, annonsid2
4 - Having your Indexes after the primary key in your Table
Upvotes: 0
Reputation: 4622
Some ideas:
Try two indexes (annonsid,annonsid2) and (annonsid2,annonsid)
Have you tried a column store index? It makes the table read only but it might improve performance.
Also, some variations of the query could help. Examples:
SELECT COUNT(*)
FROM Matching_IDs AS m
INNER JOIN Matching_IDs AS m2
ON m.annonsid2 = m2.annonsid
INNER JOIN Matching_IDs AS m3
ON m2.annonsid2 = m3.annonsid
where m.annonsid = m3.annonsid2
or
SELECT COUNT(*)
FROM Matching_IDs AS m, Matching_IDs AS m2, Matching_IDs AS m3
where m2.annonsid2 = m3.annonsid
and m.annonsid2 = m2.annonsid
and m.annonsid = m3.annonsid2
Did you check the CPU/IO-Load? If IO-Load is high, then the server is not crunching numbers but swapping => more RAM solves the problem.
How fast is this query?
SELECT COUNT(*)
FROM Matching_IDs AS m
INNER JOIN Matching_IDs AS m2
ON m.annonsid2 = m2.annonsid
If this is very fast but adding the next join slows thing down then you propably need more RAM.
Upvotes: 3
Reputation: 7695
You should make this query a bit more separated. I think first you should create a table, where you can store the primary key + annonsid, annonsid2 -if annosid is not the primary key itself.
DECLARE @AnnonsIds TABLE
(
primaryKey int,
-- if you need later more info from the original rows like captions
-- AND it is not (just) the annonsid
annonsid int,
annonsid2 int
)
If you declare a table, and you have index on this column, it is quite fast to get the specified rows by the WHERE annonsid = @annonsid OR annonsid2 = @annosid
After the first step you have a much smaller (I guess), and "thin" table to work with. Then you just have to use the joins here or make a temp table and a CTE on it.
I think it should be faster, depending on the selectivity of the condition in the WHERE
, of yourse if 1.1 million rows fits in it, then it does not make sense, but if just a couple hundred or tousend, then you should give it a try!
Upvotes: 1
Reputation: 171178
It seems like you already indexed this quite well. You can try converting the hash to a merge join by adding the right multi-column index, but it won't give you the desired speedup of 60x.
I think this index would be on annonsid, annonsid2
although I might have made a mistake here.
It would be nice to materialize all of this but indexed views do not support self-joins. You can try to materialize this query (unaggregated) into a new table. Whenever you execute DML against the base table, also update the second table (using either application logic or triggers). That would allow you to query blazingly fast.
Upvotes: 1