Reputation: 24498
I know, I know. It's not supposed to be this way.
Big picture, I'm working with map data and trying to determine which street a bus stop is closest to. A street is made up of a series of points. Some streets have 2 points, but most have about 1/2 a dozen.
My points table looks like this:
CREATE TABLE [dbo].[MapStreetsPoints](
[FeatureId] [int] NOT NULL,
[PointNumber] [int] NOT NULL,
[Latitude] [decimal](8, 5) NOT NULL,
[Longitude] [decimal](8, 5) NOT NULL,
CONSTRAINT [PK_MapStreetsPoints] PRIMARY KEY CLUSTERED
(
[FeatureId] ASC,
[PointNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
Sample data from MapStreetsPoints
FeatureId PointNumber Latitude Longitude
----------- ----------- ---------- ----------
1 1 39.81396 -75.83017
1 2 39.81392 -75.83019
1 3 39.81387 -75.83018
1 4 39.81344 -75.83003
1 5 39.81339 -75.83000
1 6 39.81336 -75.82996
1 7 39.81333 -75.82990
1 8 39.81332 -75.82983
1 9 39.81332 -75.82977
1 10 39.81335 -75.82972
2 1 39.72170 -76.11486
2 2 39.72209 -76.11482
2 3 39.72248 -76.11474
2 4 39.72279 -76.11457
2 5 39.72362 -76.11404
2 6 39.72418 -76.11364
2 7 39.72482 -76.11321
2 8 39.72526 -76.11296
2 9 39.72560 -76.11282
2 10 39.72597 -76.11275
2 11 39.72644 -76.11274
There are about 2 million rows in this table.
Basically, I am filtering the rows to something manageable, self-joining the table to itself, applying a slow order by and take the top 1 row. I do this to identify the feature closest to a point.
My inline table value function looks like this.
ALTER FUNCTION dbo.fnMapReverseGeocodeNearestStreet
(
@Longitude Decimal(8,5),
@Latitude Decimal(8,5)
)
RETURNS TABLE
AS
RETURN
(
Select Top 1
A.FeatureId
From MapStreetsPoints As A
Inner Join MapStreetsPoints As B
On A.FeatureId = B.FeatureId
And A.PointNumber = B.PointNumber - 1
Where A.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002)
and Convert(Decimal(8,5), @Longitude + 0.002)
And A.Latitude Between Convert(Decimal(8,5), @Latitude - 0.002)
And Convert(Decimal(8,5), @Latitude + 0.002)
And B.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002)
and Convert(Decimal(8,5), @Longitude + 0.002)
And B.Latitude Between Convert(Decimal(8,5), @Latitude - 0.002)
And Convert(Decimal(8,5), @Latitude + 0.002)
Order By dbo.PerpendicularDistanceToLine(@Longitude, @Latitude,
A.Longitude,
A.Latitude,
B.Longitude,
B.Latitude)
)
The multi-step table valued function looks like this.
Alter FUNCTION dbo.fnMapReverseGeocodeNearestStreet_2
(
@Longitude Decimal(8,5),
@Latitude Decimal(8,5)
)
RETURNS @Output TABLE (FeatureId Int)
AS
BEGIN
-- Fill the table variable with the rows for your result set
Declare @Temp
Table (
FeatureId Int,
StartLongitude Decimal(8,5),
StartLatitude Decimal(8,5),
EndLongitude Decimal(8,5),
EndLatitude Decimal(8,5)
);
Insert
Into @Temp(FeatureId, StartLongitude, StartLatitude, EndLongitude, EndLatitude)
Select A.FeatureId,
A.Longitude As StartLongitude,
A.Latitude As StartLatitude,
B.Longitude As EndLongitude,
B.Latitude As EndLatitude
From MapStreetsPoints As A
Inner Join MapStreetsPoints As B
On A.FeatureId = B.FeatureId
And A.PointNumber = B.PointNumber - 1
Where A.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002)
and Convert(Decimal(8,5), @Longitude + 0.002)
And A.Latitude Between Convert(Decimal(8,5), @Latitude - 0.002)
And Convert(Decimal(8,5), @Latitude + 0.002)
And B.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002)
and Convert(Decimal(8,5), @Longitude + 0.002)
And B.Latitude Between Convert(Decimal(8,5), @Latitude - 0.002)
And Convert(Decimal(8,5), @Latitude + 0.002);
Insert
Into @Output(FeatureId)
Select Top 1 FeatureId
From @Temp T
Order By dbo.PerpendicularDistanceToLine(@Longitude, @Latitude, T.StartLongitude, T.StartLatitude, T.EndLongitude, T.EndLatitude);
RETURN
END
The logic is the same. The difference is that I load a table variable with intermediate results prior to applying the slow-ish order by. It seems to me like the order by is being applied prior to filtering.
To test this, I am using the following query:
Select Map.FeatureId,
BusStop.BusStopId1,
BusStop.Description,
MapStreets.Hazardous
From BusStop
Cross Apply dbo.fnMapReverseGeocodeNearestStreet(BusStop.XCoord,BusStop.YCoord) As Map
Inner Join MapStreets
On Map.FeatureId = MapStreets.FeatureId
There's approximately 2,000 rows in the BusStop table.
When I run the test code for the inline table valued function, it takes 22 seconds. With the multi-step table value function, it takes 17 seconds.
The show plan for the itvf:
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[FeatureId], [Expr1008]) WITH UNORDERED PREFETCH)
|--Nested Loops(Inner Join, OUTER REFERENCES:([**DatabaseName**].[dbo].[BusStop].[XCoord], [**DatabaseName**].[dbo].[BusStop].[YCoord]))
| |--Index Scan(OBJECT:([**DatabaseName**].[dbo].[BusStop].[idx_BusStop_Description]))
| |--Index Spool(SEEK:([**DatabaseName**].[dbo].[BusStop].[XCoord]=[**DatabaseName**].[dbo].[BusStop].[XCoord] AND [**DatabaseName**].[dbo].[BusStop].[YCoord]=[**DatabaseName**].[dbo].[BusStop].[YCoord]))
| |--Sort(TOP 1, ORDER BY:([Expr1004] ASC))
| |--Compute Scalar(DEFINE:([Expr1004]=[**DatabaseName**].[dbo].[PerpendicularDistanceToLine]([**DatabaseName**].[dbo].[BusStop].[XCoord],[**DatabaseName**].[dbo].[BusStop].[YCoord],[**DatabaseName**].[dbo].[MapStreetsPoints].[Longitude] as [A].[Longitude],[**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [A].[Latitude],[**DatabaseName**].[dbo].[MapStreetsPoints].[Longitude] as [B].[Longitude],[**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [B].[Latitude])))
| |--Hash Match(Inner Join, HASH:([A].[FeatureId], [A].[PointNumber])=([B].[FeatureId], [Expr1007]), RESIDUAL:([**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [A].[FeatureId]=[**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [B].[FeatureId] AND [**DatabaseName**].[dbo].[MapStreetsPoints].[PointNumber] as [A].[PointNumber]=[Expr1007]))
| |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints4] AS [A]), SEEK:([A].[Longitude] >= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]-(0.002),0) AND [A].[Longitude] <= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]+(0.002),0)), WHERE:([**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [A].[Latitude]>=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]-(0.002),0) AND [**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [A].[Latitude]<=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]+(0.002),0)) ORDERED FORWARD)
| |--Compute Scalar(DEFINE:([Expr1007]=[**DatabaseName**].[dbo].[MapStreetsPoints].[PointNumber] as [B].[PointNumber]-(1)))
| |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints4] AS [B]), SEEK:([B].[Longitude] >= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]-(0.002),0) AND [B].[Longitude] <= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]+(0.002),0)), WHERE:([**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [B].[Latitude]>=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]-(0.002),0) AND [**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [B].[Latitude]<=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]+(0.002),0)) ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreets].[PK_MapStreets]), SEEK:([**DatabaseName**].[dbo].[MapStreets].[FeatureId]=[**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [A].[FeatureId]) ORDERED FORWARD)
XML Version of showplan for multi-step version
The Show Plan for the multi step version is:
|--Table Insert(OBJECT:(@Temp), SET:([FeatureId] = [**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [A].[FeatureId],[StartLongitude] = [**DatabaseName**].[dbo].[MapStreetsPoints].[Longitude] as [A].[Longitude],[StartLatitude] = [UDSD 2021
|--Top(ROWCOUNT est 0)
|--Nested Loops(Inner Join, OUTER REFERENCES:([B].[FeatureId], [Expr1013]))
|--Compute Scalar(DEFINE:([Expr1013]=[**DatabaseName**].[dbo].[MapStreetsPoints].[PointNumber] as [B].[PointNumber]-(1)))
| |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints4] AS [B]), SEEK:([B].[Longitude] >= CONVERT(decimal(8,5),[@Longitude]-(0.002),0) AND [B].[Longitude] <= CONVERT(decimal(8,5),[@Longitude]+(0.0
|--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints_PointNumber] AS [A]), SEEK:([A].[PointNumber]=[Expr1013] AND [A].[FeatureId]=[**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [B].[FeatureI
|--Table Insert(OBJECT:([**DatabaseName**].[dbo].[fnMapReverseGeocodeNearestStreet_2]), SET:([FeatureId] = @Temp.[FeatureId] as [T].[FeatureId]))
|--Sort(TOP 1, ORDER BY:([Expr1005] ASC))
|--Compute Scalar(DEFINE:([Expr1005]=[**DatabaseName**].[dbo].[PerpendicularDistanceToLine]([@Longitude],[@Latitude],@Temp.[StartLongitude] as [T].[StartLongitude],@Temp.[StartLatitude] as [T].[StartLatitude],@Temp.[EndLongitude] as [T]
|--Table Scan(OBJECT:(@Temp AS [T]))
I can't figure out why the inline table value function is slower. My best guess is that the order by is being applied too soon and is therefore running on too many rows. I should mention that after the filtering is applied, there is usually less than 100 rows for the order by to look at.
PerpendicularDistanceToLine is a multi-statement UDF that returns a scalar. It has no table access, it simply applies a series of math operations on the input.
Upvotes: 4
Views: 608
Reputation: 453746
TLDR;
The function has a predicate on
Longitude BETWEEN XCoord-0.002 AND XCoord+0.002
AND Latitude BETWEEN YCoord-0.002 AND YCoord+0.002`
The XCoord
and YCoord
are passed in from the outer table (BusStop).
For the inline table valued function the plan references these outer columns directly and SQL Server just guesses the cardinality returned based on a hardcoded formula as 6,557.1
per execution and uses a hash join.
For the multistatement TVF these values are passed in as parameter values. This allows it to sniff the values passed in when it is actually first executed and compile a different plan.
The different plan will still only be based on one set of parameter values (not optimised separately for each outer row) but potentially the different plan may be better enough overall in your case that it more than mitigates the overhead of the multistatement TVF itself.
Multistatement TVF plan (just function part)
My best guess is that the order by is being applied too soon and is therefore running on too many rows.
No this isn't the case. Both the execution plans have the same basic strategy.
There are other differences too of course.
The inline table valued function version spools the results from step b in case it sees the same correlated parameter values (but presumably every bus stop is in a unique location so this never happens in practice).
The multi statement TVF plan has separate sub trees for 1a and 1b and they are inserted into two different table variables (The TOP
operator in 1a
here is a "rowcount top" in case the execution plan is used when SET ROWCOUNT
is set and does not filter the rows down at all).
Nonetheless the number of Scalar UDF invocations and rows going into the TOP N sort should be the same in both cases as this step is running on the result of step 1a
What is different is the join strategies used in step 1a itself.
Inline TVF plan
A.FeatureId = B.FeatureId And A.PointNumber = B.PointNumber - 1
The seeks above are using a BETWEEN
with values passed in from the outer row. SQL Server guesses each execution will return 16,557.1 rows. I'm not sure what the exact formula is but this estimate is entirely a guess based on table cardinality. I was able to get the exact same guess when I created a table my side and inserted 2,044,080 rows and used the legacy cardinality estimator.
MultiStatement TVF plan
PointNumber, FeatureId
and then a range on the Latitude, Longitude
portion.The initial seek above is identical to the one in the hash join case. The only difference is whether or not the predicate is expressed as an outer table reference or function parameter but the estimated rows are radically different.
The plan you provided was an estimated plan and was compiled with null parameter values. You should gather the actual plan and check whether or not it is the same.
On the assumption that the plans are the same then the conclusion would be that nested loops is in practice a much better join choice for your example data than hash join. And better enough to overcome the extra baggage for the multistatement TVF.
You should try the following to get the same nested loops plan but without the multistatement baggage and see how that performs.
CREATE FUNCTION dbo.fnMapReverseGeocodeNearestStreet_i2
(
@Longitude Decimal(8,5),
@Latitude Decimal(8,5)
)
RETURNS TABLE
AS
RETURN
(
Select Top 1
A.FeatureId
From MapStreetsPoints As A
Inner LOOP Join MapStreetsPoints As B with (index = [MapStreetsPoints_PointNumber])
On A.FeatureId = B.FeatureId
And A.PointNumber + 1 = B.PointNumber
Where A.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002)
and Convert(Decimal(8,5), @Longitude + 0.002)
And A.Latitude Between Convert(Decimal(8,5), @Latitude - 0.002)
And Convert(Decimal(8,5), @Latitude + 0.002)
And B.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002)
and Convert(Decimal(8,5), @Longitude + 0.002)
And B.Latitude Between Convert(Decimal(8,5), @Latitude - 0.002)
And Convert(Decimal(8,5), @Latitude + 0.002)
Order By dbo.PerpendicularDistanceToLine(@Longitude, @Latitude,
A.Longitude,
A.Latitude,
B.Longitude,
B.Latitude)
)
And you should certainly get rid of the scalar UDF and replace it with an inline TVF to get rid of the overhead associated with non inline scalar UDFs
Upvotes: 2