George Mastros
George Mastros

Reputation: 24498

Multi step table value function faster than inline table value function

I know, I know. It's not supposed to be this way.

Big picture, I'm working with map data and trying to determine which street a bus stop is closest to. A street is made up of a series of points. Some streets have 2 points, but most have about 1/2 a dozen.

My points table looks like this:

CREATE TABLE [dbo].[MapStreetsPoints](
    [FeatureId] [int] NOT NULL,
    [PointNumber] [int] NOT NULL,
    [Latitude] [decimal](8, 5) NOT NULL,
    [Longitude] [decimal](8, 5) NOT NULL,
 CONSTRAINT [PK_MapStreetsPoints] PRIMARY KEY CLUSTERED 
(
    [FeatureId] ASC,
    [PointNumber] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

Sample data from MapStreetsPoints

FeatureId   PointNumber Latitude   Longitude
----------- ----------- ---------- ----------
1           1           39.81396   -75.83017
1           2           39.81392   -75.83019
1           3           39.81387   -75.83018
1           4           39.81344   -75.83003
1           5           39.81339   -75.83000
1           6           39.81336   -75.82996
1           7           39.81333   -75.82990
1           8           39.81332   -75.82983
1           9           39.81332   -75.82977
1           10          39.81335   -75.82972
2           1           39.72170   -76.11486
2           2           39.72209   -76.11482
2           3           39.72248   -76.11474
2           4           39.72279   -76.11457
2           5           39.72362   -76.11404
2           6           39.72418   -76.11364
2           7           39.72482   -76.11321
2           8           39.72526   -76.11296
2           9           39.72560   -76.11282
2           10          39.72597   -76.11275
2           11          39.72644   -76.11274

There are about 2 million rows in this table.

Basically, I am filtering the rows to something manageable, self-joining the table to itself, applying a slow order by and take the top 1 row. I do this to identify the feature closest to a point.

My inline table value function looks like this.

ALTER FUNCTION dbo.fnMapReverseGeocodeNearestStreet
(   
    @Longitude Decimal(8,5),
    @Latitude Decimal(8,5)
)
RETURNS TABLE 
AS
RETURN 
(
    Select  Top 1 
            A.FeatureId
    From    MapStreetsPoints As A
            Inner Join MapStreetsPoints As B
                On A.FeatureId = B.FeatureId
                And A.PointNumber = B.PointNumber - 1
    Where   A.Longitude Between     Convert(Decimal(8,5), @Longitude - 0.002) 
                                    and Convert(Decimal(8,5), @Longitude + 0.002)
            And A.Latitude Between  Convert(Decimal(8,5), @Latitude - 0.002)
                                    And Convert(Decimal(8,5), @Latitude + 0.002)
            And B.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002) 
                                    and Convert(Decimal(8,5), @Longitude + 0.002)
            And B.Latitude Between  Convert(Decimal(8,5), @Latitude - 0.002)
                                    And Convert(Decimal(8,5), @Latitude + 0.002)
    Order By dbo.PerpendicularDistanceToLine(@Longitude, @Latitude, 
                A.Longitude, 
                A.Latitude, 
                B.Longitude, 
                B.Latitude)
)

The multi-step table valued function looks like this.

Alter FUNCTION dbo.fnMapReverseGeocodeNearestStreet_2
(   
    @Longitude Decimal(8,5),
    @Latitude Decimal(8,5)
)
RETURNS @Output TABLE (FeatureId Int)
AS
BEGIN
    -- Fill the table variable with the rows for your result set
    Declare @Temp 
    Table   (
                FeatureId Int,
                StartLongitude Decimal(8,5),
                StartLatitude Decimal(8,5),
                EndLongitude Decimal(8,5),
                EndLatitude Decimal(8,5)
            );
        
    Insert 
    Into    @Temp(FeatureId, StartLongitude, StartLatitude, EndLongitude, EndLatitude)
    Select  A.FeatureId,
            A.Longitude As StartLongitude,
            A.Latitude As StartLatitude,
            B.Longitude As EndLongitude,
            B.Latitude As EndLatitude
    From    MapStreetsPoints As A
            Inner Join MapStreetsPoints As B
                On A.FeatureId = B.FeatureId
                And A.PointNumber = B.PointNumber - 1
    Where   A.Longitude Between     Convert(Decimal(8,5), @Longitude - 0.002) 
                                    and Convert(Decimal(8,5), @Longitude + 0.002)
            And A.Latitude Between  Convert(Decimal(8,5), @Latitude - 0.002)
                                    And Convert(Decimal(8,5), @Latitude + 0.002)
            And B.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002) 
                                    and Convert(Decimal(8,5), @Longitude + 0.002)
            And B.Latitude Between  Convert(Decimal(8,5), @Latitude - 0.002)
                                    And Convert(Decimal(8,5), @Latitude + 0.002);

        Insert
        Into    @Output(FeatureId)
        Select  Top 1 FeatureId
        From    @Temp T
        Order By dbo.PerpendicularDistanceToLine(@Longitude, @Latitude, T.StartLongitude, T.StartLatitude, T.EndLongitude, T.EndLatitude);


    RETURN 
END

The logic is the same. The difference is that I load a table variable with intermediate results prior to applying the slow-ish order by. It seems to me like the order by is being applied prior to filtering.

To test this, I am using the following query:

Select  Map.FeatureId,
        BusStop.BusStopId1,
        BusStop.Description,
        MapStreets.Hazardous
From    BusStop
        Cross Apply dbo.fnMapReverseGeocodeNearestStreet(BusStop.XCoord,BusStop.YCoord) As Map
        Inner Join MapStreets
            On Map.FeatureId = MapStreets.FeatureId

There's approximately 2,000 rows in the BusStop table.

When I run the test code for the inline table valued function, it takes 22 seconds. With the multi-step table value function, it takes 17 seconds.

XML Version of showplan

The show plan for the itvf:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([A].[FeatureId], [Expr1008]) WITH UNORDERED PREFETCH)
       |--Nested Loops(Inner Join, OUTER REFERENCES:([**DatabaseName**].[dbo].[BusStop].[XCoord], [**DatabaseName**].[dbo].[BusStop].[YCoord]))
       |    |--Index Scan(OBJECT:([**DatabaseName**].[dbo].[BusStop].[idx_BusStop_Description]))
       |    |--Index Spool(SEEK:([**DatabaseName**].[dbo].[BusStop].[XCoord]=[**DatabaseName**].[dbo].[BusStop].[XCoord] AND [**DatabaseName**].[dbo].[BusStop].[YCoord]=[**DatabaseName**].[dbo].[BusStop].[YCoord]))
       |         |--Sort(TOP 1, ORDER BY:([Expr1004] ASC))
       |              |--Compute Scalar(DEFINE:([Expr1004]=[**DatabaseName**].[dbo].[PerpendicularDistanceToLine]([**DatabaseName**].[dbo].[BusStop].[XCoord],[**DatabaseName**].[dbo].[BusStop].[YCoord],[**DatabaseName**].[dbo].[MapStreetsPoints].[Longitude] as [A].[Longitude],[**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [A].[Latitude],[**DatabaseName**].[dbo].[MapStreetsPoints].[Longitude] as [B].[Longitude],[**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [B].[Latitude])))
       |                   |--Hash Match(Inner Join, HASH:([A].[FeatureId], [A].[PointNumber])=([B].[FeatureId], [Expr1007]), RESIDUAL:([**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [A].[FeatureId]=[**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [B].[FeatureId] AND [**DatabaseName**].[dbo].[MapStreetsPoints].[PointNumber] as [A].[PointNumber]=[Expr1007]))
       |                        |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints4] AS [A]), SEEK:([A].[Longitude] >= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]-(0.002),0) AND [A].[Longitude] <= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]+(0.002),0)),  WHERE:([**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [A].[Latitude]>=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]-(0.002),0) AND [**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [A].[Latitude]<=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]+(0.002),0)) ORDERED FORWARD)
       |                        |--Compute Scalar(DEFINE:([Expr1007]=[**DatabaseName**].[dbo].[MapStreetsPoints].[PointNumber] as [B].[PointNumber]-(1)))
       |                             |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints4] AS [B]), SEEK:([B].[Longitude] >= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]-(0.002),0) AND [B].[Longitude] <= CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[XCoord]+(0.002),0)),  WHERE:([**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [B].[Latitude]>=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]-(0.002),0) AND [**DatabaseName**].[dbo].[MapStreetsPoints].[Latitude] as [B].[Latitude]<=CONVERT(decimal(8,5),[**DatabaseName**].[dbo].[BusStop].[YCoord]+(0.002),0)) ORDERED FORWARD)
       |--Clustered Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreets].[PK_MapStreets]), SEEK:([**DatabaseName**].[dbo].[MapStreets].[FeatureId]=[**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [A].[FeatureId]) ORDERED FORWARD)

XML Version of showplan for multi-step version

The Show Plan for the multi step version is:

    |--Table Insert(OBJECT:(@Temp), SET:([FeatureId] = [**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [A].[FeatureId],[StartLongitude] = [**DatabaseName**].[dbo].[MapStreetsPoints].[Longitude] as [A].[Longitude],[StartLatitude] = [UDSD 2021
         |--Top(ROWCOUNT est 0)
              |--Nested Loops(Inner Join, OUTER REFERENCES:([B].[FeatureId], [Expr1013]))
                   |--Compute Scalar(DEFINE:([Expr1013]=[**DatabaseName**].[dbo].[MapStreetsPoints].[PointNumber] as [B].[PointNumber]-(1)))
                   |    |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints4] AS [B]), SEEK:([B].[Longitude] >= CONVERT(decimal(8,5),[@Longitude]-(0.002),0) AND [B].[Longitude] <= CONVERT(decimal(8,5),[@Longitude]+(0.0
                   |--Index Seek(OBJECT:([**DatabaseName**].[dbo].[MapStreetsPoints].[MapStreetsPoints_PointNumber] AS [A]), SEEK:([A].[PointNumber]=[Expr1013] AND [A].[FeatureId]=[**DatabaseName**].[dbo].[MapStreetsPoints].[FeatureId] as [B].[FeatureI


    |--Table Insert(OBJECT:([**DatabaseName**].[dbo].[fnMapReverseGeocodeNearestStreet_2]), SET:([FeatureId] = @Temp.[FeatureId] as [T].[FeatureId]))
         |--Sort(TOP 1, ORDER BY:([Expr1005] ASC))
              |--Compute Scalar(DEFINE:([Expr1005]=[**DatabaseName**].[dbo].[PerpendicularDistanceToLine]([@Longitude],[@Latitude],@Temp.[StartLongitude] as [T].[StartLongitude],@Temp.[StartLatitude] as [T].[StartLatitude],@Temp.[EndLongitude] as [T]
                   |--Table Scan(OBJECT:(@Temp AS [T]))

I can't figure out why the inline table value function is slower. My best guess is that the order by is being applied too soon and is therefore running on too many rows. I should mention that after the filtering is applied, there is usually less than 100 rows for the order by to look at.

PerpendicularDistanceToLine is a multi-statement UDF that returns a scalar. It has no table access, it simply applies a series of math operations on the input.

Upvotes: 4

Views: 608

Answers (1)

Martin Smith
Martin Smith

Reputation: 453746

TLDR;

The function has a predicate on

Longitude BETWEEN XCoord-0.002 AND XCoord+0.002 
  AND Latitude BETWEEN YCoord-0.002 AND YCoord+0.002`

The XCoord and YCoord are passed in from the outer table (BusStop).

For the inline table valued function the plan references these outer columns directly and SQL Server just guesses the cardinality returned based on a hardcoded formula as 6,557.1 per execution and uses a hash join.

For the multistatement TVF these values are passed in as parameter values. This allows it to sniff the values passed in when it is actually first executed and compile a different plan.

The different plan will still only be based on one set of parameter values (not optimised separately for each outer row) but potentially the different plan may be better enough overall in your case that it more than mitigates the overhead of the multistatement TVF itself.

Inline TVF plan enter image description here

Multistatement TVF plan (just function part) enter image description here

My best guess is that the order by is being applied too soon and is therefore running on too many rows.

No this isn't the case. Both the execution plans have the same basic strategy.

  1. For each row in BusStop
    • a) Calculate the result of MapStreetsPoints joined on itself using the correlated parameters from the outer row
    • b) Evaluate the scalar UDF on the joined result and do a Top 1 Sort and return the top row.
  2. Then do a seek against MapStreets for each row returned from step 1

There are other differences too of course.

The inline table valued function version spools the results from step b in case it sees the same correlated parameter values (but presumably every bus stop is in a unique location so this never happens in practice).

The multi statement TVF plan has separate sub trees for 1a and 1b and they are inserted into two different table variables (The TOP operator in 1a here is a "rowcount top" in case the execution plan is used when SET ROWCOUNT is set and does not filter the rows down at all).

Nonetheless the number of Scalar UDF invocations and rows going into the TOP N sort should be the same in both cases as this step is running on the result of step 1a

What is different is the join strategies used in step 1a itself.

Inline TVF plan

  • Seek on Longitude BETWEEN BusStop.XCoord-(0.002) AND BusStop.XCoord+(0.002) with residual predicate on Latitude BETWEEN BusStop.YCoord-(0.002) AND BusStop.YCoord+(0.002).
  • Seek on Longitude BETWEEN BusStop.XCoord-(0.002) AND BusStop.XCoord+(0.002) with residual predicate on Latitude BETWEEN BusStop.YCoord-(0.002) AND BusStop.YCoord+(0.002)
  • Use a hash join to join the two identical seeks on A.FeatureId = B.FeatureId And A.PointNumber = B.PointNumber - 1

The seeks above are using a BETWEEN with values passed in from the outer row. SQL Server guesses each execution will return 16,557.1 rows. I'm not sure what the exact formula is but this estimate is entirely a guess based on table cardinality. I was able to get the exact same guess when I created a table my side and inserted 2,044,080 rows and used the legacy cardinality estimator.

enter image description here

MultiStatement TVF plan

  • This does a Seek on Longitude BETWEEN @Longitude-(0.002) AND @Longitude+(0.002) with residual predicate on Latitude BETWEEN @Latitude-(0.002) AND @Latitude+(0.002).
  • But it does a nested loops and seeks into a different index with an equality condition on PointNumber, FeatureId and then a range on the Latitude, Longitude portion.

The initial seek above is identical to the one in the hash join case. The only difference is whether or not the predicate is expressed as an outer table reference or function parameter but the estimated rows are radically different.

enter image description here

The plan you provided was an estimated plan and was compiled with null parameter values. You should gather the actual plan and check whether or not it is the same.

On the assumption that the plans are the same then the conclusion would be that nested loops is in practice a much better join choice for your example data than hash join. And better enough to overcome the extra baggage for the multistatement TVF.

You should try the following to get the same nested loops plan but without the multistatement baggage and see how that performs.

CREATE FUNCTION dbo.fnMapReverseGeocodeNearestStreet_i2
(   
    @Longitude Decimal(8,5),
    @Latitude Decimal(8,5)
)
RETURNS TABLE 
AS
RETURN 
(
    Select  Top 1 
            A.FeatureId
    From    MapStreetsPoints As A
            Inner LOOP Join MapStreetsPoints As B with (index = [MapStreetsPoints_PointNumber])
                On A.FeatureId = B.FeatureId
                And A.PointNumber + 1 = B.PointNumber
    Where   A.Longitude Between     Convert(Decimal(8,5), @Longitude - 0.002) 
                                    and Convert(Decimal(8,5), @Longitude + 0.002)
            And A.Latitude Between  Convert(Decimal(8,5), @Latitude - 0.002)
                                    And Convert(Decimal(8,5), @Latitude + 0.002)
            And B.Longitude Between Convert(Decimal(8,5), @Longitude - 0.002) 
                                    and Convert(Decimal(8,5), @Longitude + 0.002)
            And B.Latitude Between  Convert(Decimal(8,5), @Latitude - 0.002)
                                    And Convert(Decimal(8,5), @Latitude + 0.002)
    Order By dbo.PerpendicularDistanceToLine(@Longitude, @Latitude, 
                A.Longitude, 
                A.Latitude, 
                B.Longitude, 
                B.Latitude)
)

And you should certainly get rid of the scalar UDF and replace it with an inline TVF to get rid of the overhead associated with non inline scalar UDFs

Upvotes: 2

Related Questions