OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

SQL Server 2012 Estimated Row Numbers Much Different than Actual

I have a query that cross joins two tables. TABLE_1 has 15,000 rows and TABLE_2 has 50,000 rows. A query very similar to this one has run in the past in roughly 10 minutes. Now it is running indefinitely with the same server situation (i.e. nothing else running), and the very similar query is also running indefinitely.

SELECT A.KEY_1
      ,A.FULL_TEXT_1
      ,B.FULL_TEXT_2
      ,B.KEY_2
      ,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) >= 0.9

When I run the estimated execution plan for this query, the Nested Loops (Inner Join) node is estimated at 96% of the execution. The estimated number of rows is 218 million, even though cross joining the tables should result in 15,000 * 50,000 = 750 million rows. When I add INSERT INTO #temp_table to the beginning of the query, the estimated execution plan puts Insert Into at 97% and estimates the number of rows as 218 million. In reality, there should be less than 100 matches that have a similarity score above 0.9.

I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?

Upvotes: 0

Views: 119

Answers (3)

juvian
juvian

Reputation: 16068

For better performance, use the minScoreHint parameter. This allows to prevent doing the full similarity calculation for many pairs and early exit.

So this should run quicker:

SELECT A.KEY_1
      ,A.FULL_TEXT_1
      ,B.FULL_TEXT_2
      ,B.KEY_2
      ,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) >= 0.9

It is not clear from docs if 0.9 results would be included. If not, change 0.9 to 0.89

Upvotes: 1

FrugalShaun
FrugalShaun

Reputation: 176

The link provided by scsimon will help you prove whether it's statistics or not. Have the estimates changed significantly since to when it was running fast?

Parallelism spring to mind. If the query was going parallel, but now isn't (e.g. if a server setting has been changed, or statistics) then that could cause significant performance degradation.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269763

I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?

Yes, this is true. It particularly affects optimizations involving join algorithms, aggregation algorithms, and indexes.

But it is not true for your query. Your query has to do a nested loops join with no indexes. All pairs of values in the two tables need to be compared. There is little algorithmic flexibility and (standard) indexes cannot really help.

Upvotes: 1

Related Questions