Grouping based on the match percentage

Question

I need to self-join and display the records based on the match percentage

Name | Village
Jones Ashley, MPK
Meyer Peter, JSK
A Jones, MPK
David, ARK
Peter M, JSK
Peter M, JSK
David, ARK

select 
x.Name, 
y.Name,
dbo.matchname(x.Name, y.Name) 'match'  
from cust x, cust y where dbo.matchname(x.Name, y.Name) >= 80
and x.village = y.village

I have written a function which takes two names and calculates the percentage. For ex: Peter M and Meyer Peter is 80%.

I would now like to display the related records order by the match percentage. For ex:

Jones Ashley, MPK
A Jones, MPK
David, ARK 
David, ARK 
Peter M, JSK
Peter M, JSK
Meyer Peter, JSK

Order by doesn't work here as the initial could be at the last for some names. I need some kind of grouping but not sure how to do.

Alan Burstein · Accepted Answer

I don't know what your matchname scalar function does so I just created a generic scalar function that compares two strings and returns a small number.

-- (0) Prep: a matchname function
IF OBJECT_ID('tempdb.dbo.matchname') IS NOT NULL DROP FUNCTION dbo.matchname;
GO
CREATE FUNCTION dbo.matchname(@string1 varchar(40), @string2 varchar(40))
RETURNS int AS
BEGIN RETURN((ABS(ASCII(@string1)+3) - (ASCII(@string2))))*7 END;

Below is some sample data and a solution. The most important thing to note is how I filtered my CROSS JOIN:

WHERE x.someid < y.someid

Doing this prevents you from evaluating the same records twice; e.g. dbo.matchname(John Smith, George Washington) & dbo.matchname(George Washington, John Smith.

Sample data and solution

-- Sample data
DECLARE @yourtable TABLE 
(
  someid int identity primary key clustered, 
  [Name] varchar(40), 
  Village varchar(10)
  ,index nc_yt nonclustered([Name] ASC)
);
INSERT @yourtable ([Name], Village) 
VALUES
('Jones Ashley', 'MPK'),
('Meyer Peter', 'JSK'),
('A Jones', 'MPK'),
('David', 'ARK'),
('Peter M', 'JSK'),
('Peter M', 'JSK'),
('David', 'ARK');

-- Solution
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, dbo.matchname(x.name, y.name) 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    id1 = x.someid, id2 = y.someid,
    dbo.matchname(x.name, y.name) AS match
  from @yourtable x 
  CROSS JOIN @yourtable y
  WHERE x.someid < y.someid 
  AND dbo.matchname(x.Name, y.Name) >= 80
)
SELECT Name1, Name2, match
FROM uniqueList
WHERE rn = 1
ORDER BY match;

Now about that scalar valued function... Scalar Valued User Defined Functions (scalar UDF for short) KILL performance, especially how you're using yours!. You can replace scalar UDFs with inline table valued functions (iTVF) for optimal performance.

First the new function:

IF OBJECT_ID('tempdb.dbo.itvf_matchname') IS NOT NULL DROP FUNCTION dbo.itvf_matchname;
GO
CREATE FUNCTION dbo.itvf_matchname(@string1 varchar(40), @string2 varchar(40))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN(SELECT match = (ABS(ASCII(@string1)+3) - (ASCII(@string2)))*7);

Now a solution (note that I commented out the original scalar UDF code):

-- sample data
DECLARE @yourtable TABLE 
(
  someid int identity primary key clustered, 
  [Name] varchar(40), 
  Village varchar(10)
  ,index nc_yt nonclustered([Name] ASC)
);
INSERT @yourtable ([Name], Village) 
VALUES
('Jones Ashley', 'MPK'),
('Meyer Peter', 'JSK'),
('A Jones', 'MPK'),
('David', 'ARK'),
('Peter M', 'JSK'),
('Peter M', 'JSK'),
('David', 'ARK');

-- solution
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    --dbo.matchname(x.name, y.name) AS match
    itvf.match
  from @yourtable x 
  CROSS JOIN @yourtable y
  -- Below: only 1 function call with results referenced multiple times
  CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
  WHERE x.someid < y.someid 
  --AND dbo.matchname(x.Name, y.Name) >= 80
  AND itvf.match >= 80
)
SELECT Name1, Name2, match
FROM uniqueList
WHERE rn = 1;

The results are identical but the performance is profoundly better. To better understand why you should replace scalar UDFs with iTVFs let's do a 1500 row test (which means we'll be evaluating 1 million+ rows):

-- (3.1) Sample Data with an ID
SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#yourtable') IS NOT NULL DROP TABLE #yourtable;

CREATE TABLE #yourtable 
(
  someid int identity primary key clustered, 
  [Name] varchar(40)  NOT NULL, 
  Village varchar(10) NOT NULL
);
INSERT #yourtable
SELECT TOP (1500) LEFT(REPLACE(newid(),'-',''),10), 'xxx'
FROM sys.all_columns a 
CROSS JOIN sys.all_columns b;
GO
CREATE NONCLUSTERED INDEX nc_yt ON #yourTable([Name] ASC);
GO


PRINT 'Scalar function'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, dbo.matchname(x.name, y.name) 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    dbo.matchname(x.name, y.name) AS match
  from #yourtable x 
  CROSS JOIN #yourtable y
  WHERE x.someid < y.someid 
  AND dbo.matchname(x.Name, y.Name) >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1;
PRINT DATEDIFF(MS, @st, getdate());
GO 5

PRINT char(13)+char(10)+'ITVF (serial)'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    --dbo.matchname(x.name, y.name) AS match
    itvf.match
  from #yourtable x 
  CROSS JOIN #yourtable y
  -- Below: only 1 function call with results referenced multiple times
  CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
  WHERE x.someid < y.someid 
  --AND dbo.matchname(x.Name, y.Name) >= 80
  AND itvf.match >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5

PRINT char(13)+char(10)+'ITVF Parallel'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
  select 
    rn = ROW_NUMBER() OVER 
        (
          PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match 
          ORDER BY (SELECT NULL)
        ),
    Name1 = x.Name,
    Name2 = y.Name,
    --dbo.matchname(x.name, y.name) AS match
    itvf.match
  from #yourtable x 
  CROSS JOIN #yourtable y
  -- Below: only 1 function call with results referenced multiple times
  CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
  CROSS APPLY dbo.make_parallel()
  WHERE x.someid < y.someid 
  --AND dbo.matchname(x.Name, y.Name) >= 80
  AND itvf.match >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1;
PRINT DATEDIFF(MS, @st, getdate());
GO 5

And the results:

Scalar function
--------------------------------------------------
Beginning execution loop
4627
4504
4440
4457
4550
Batch execution completed 5 times.

ITVF (serial)
--------------------------------------------------
Beginning execution loop
1623
1610
1643
1640
1713
Batch execution completed 5 times.

ITVF Parallel
--------------------------------------------------
Beginning execution loop
1306
1067
1077
1127
1047
Batch execution completed 5 times.

The iTVF based solution was ~3X faster when run with a serial plan, 4X faster with a parallel plan.

Grouping based on the match percentage

Answers (1)

Related Questions