infinity1975
infinity1975

Reputation: 423

Optimize self join on millions of rows

I have a table which is a link table from objects in my SQL Server 2012 database, (annonsid, annonsid2). This table is used to create chains of triangle or even rectangles to see who can swap with who.

This is the query I use on the table Matching_IDs which has 1,5 million rows in it, producing 14 million possible chains using this query:

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid
  INNER JOIN Matching_IDs AS m3
     ON m2.annonsid2 = m3.annonsid
       AND m.annonsid = m3.annonsid2

I must improve performance to take maybe 1 second or less, Is there a faster way to do this? The query takes about 1 minute on my computer. I normally use a WHERE m.annonsid=x, but it takes just the same amount of time, cause it has to go through all possible combinations anyway.

Update: the latest query plan

|--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
   |--Stream Aggregate(DEFINE:([globalagg1011]=SUM([partialagg1010])))
        |--Parallelism(Gather Streams)
             |--Stream Aggregate(DEFINE:([partialagg1010]=Count(*)))
                  |--Hash Match(Inner Join, HASH:([m2].[annonsid2], [m2].[annonsid])=([m3].[annonsid], [m].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m2].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m3].[annonsid] AND [MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m2].[annonsid]))
                       |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m2].[annonsid2], [m2].[annonsid]))
                       |    |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m2]))
                       |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid], [m].[annonsid2]))
                            |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([m].[annonsid])=([m3].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m].[annonsid]=[MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m3].[annonsid2]))
                                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[annonsid]), ORDER BY:([m].[annonsid] ASC))
                                 |    |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133152] AS [m]), ORDERED FORWARD)
                                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid2]), ORDER BY:([m3].[annonsid2] ASC))
                                      |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m3]), ORDERED FORWARD)

Upvotes: 8

Views: 11247

Answers (5)

HABO
HABO

Reputation: 15816

You could denormalize the data by adding a table RelatedIds with AnnonsId, RelatedAnnonId and Distance. For every value of AnnonsId the table would contains rows for each RelatedAnnonId and the number of relations that need to be traversed to reach it, aka Distance. Triggers on the existing MatchingIds table would maintain the new table with some configured maximum value for Distance, e.g. 3 to handle rectangular shares. Index the table on (AnnonsId, Distance).

Edit: An index on (Distance, AnnonsId) will allow you to quickly find rows that have enough related entries to form a particular shape. Adding a column for MaxDistance may be useful if you want to be able to exclude rows based on, for example, having a triangular but not rectangular relationship.

The new query would inner join RelatedIds as RI on RI.AnnonsId = m.AnnonsId and RI.Distance <= @MaxDistance with the desired "shape" dictating the value of @MaxDistance.

It should provide much better performance on the select. Downsides are another table with a large number of rows and overhead of the triggers when altering the MatchingIds table.

Example: There are two entries in Matching_IDs: (1,2) and (2,3). The new table would contain 3 entries:
1-> 2: distance = 1
1-> 3: distance = 2 (it takes one intermediate 'node' to go from 1 to 3)
2-> 3: distance = 1

adding one more entry to the matching ids (3,1) would result in another entry:
1-> 1: distance = 3

And voilá: You found a triangle (distance=3).

Now, to find all triangles, simply do this:

select * 
  from RelatedIds 
 where AnnonsId=RelatedAnnonId 
   and Distance=3

Upvotes: 0

Alireza Masali
Alireza Masali

Reputation: 688

1 - change select Count(*) to Count(1) or Count(id)

2 - Write set Nocount on at the first of your Stored procedure or at the first of your Query

3 - Use index on annonsid , annonsid2

4 - Having your Indexes after the primary key in your Table

Upvotes: 0

alzaimar
alzaimar

Reputation: 4622

Some ideas:

Try two indexes (annonsid,annonsid2) and (annonsid2,annonsid)

Have you tried a column store index? It makes the table read only but it might improve performance.

Also, some variations of the query could help. Examples:

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid
  INNER JOIN Matching_IDs AS m3
     ON m2.annonsid2 = m3.annonsid
where m.annonsid = m3.annonsid2

or

SELECT COUNT(*)
FROM Matching_IDs AS m, Matching_IDs AS m2, Matching_IDs AS m3
where m2.annonsid2 = m3.annonsid
  and m.annonsid2 = m2.annonsid
  and m.annonsid = m3.annonsid2

Did you check the CPU/IO-Load? If IO-Load is high, then the server is not crunching numbers but swapping => more RAM solves the problem.

How fast is this query?

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid

If this is very fast but adding the next join slows thing down then you propably need more RAM.

Upvotes: 3

Andr&#225;s Ott&#243;
Andr&#225;s Ott&#243;

Reputation: 7695

You should make this query a bit more separated. I think first you should create a table, where you can store the primary key + annonsid, annonsid2 -if annosid is not the primary key itself.

DECLARE @AnnonsIds TABLE
(
primaryKey int,
-- if you need later more info from the original rows like captions  
-- AND it is not (just) the annonsid
annonsid int,
annonsid2 int
)

If you declare a table, and you have index on this column, it is quite fast to get the specified rows by the WHERE annonsid = @annonsid OR annonsid2 = @annosid

After the first step you have a much smaller (I guess), and "thin" table to work with. Then you just have to use the joins here or make a temp table and a CTE on it.

I think it should be faster, depending on the selectivity of the condition in the WHERE, of yourse if 1.1 million rows fits in it, then it does not make sense, but if just a couple hundred or tousend, then you should give it a try!

Upvotes: 1

usr
usr

Reputation: 171178

It seems like you already indexed this quite well. You can try converting the hash to a merge join by adding the right multi-column index, but it won't give you the desired speedup of 60x.

I think this index would be on annonsid, annonsid2 although I might have made a mistake here.

It would be nice to materialize all of this but indexed views do not support self-joins. You can try to materialize this query (unaggregated) into a new table. Whenever you execute DML against the base table, also update the second table (using either application logic or triggers). That would allow you to query blazingly fast.

Upvotes: 1

Related Questions