Reputation: 4809
I need to self-join and display the records based on the match percentage
Name | Village
Jones Ashley, MPK
Meyer Peter, JSK
A Jones, MPK
David, ARK
Peter M, JSK
Peter M, JSK
David, ARK
select
x.Name,
y.Name,
dbo.matchname(x.Name, y.Name) 'match'
from cust x, cust y where dbo.matchname(x.Name, y.Name) >= 80
and x.village = y.village
I have written a function which takes two names and calculates the percentage. For ex: Peter M and Meyer Peter is 80%.
I would now like to display the related records order by the match percentage. For ex:
Jones Ashley, MPK
A Jones, MPK
David, ARK
David, ARK
Peter M, JSK
Peter M, JSK
Meyer Peter, JSK
Order by doesn't work here as the initial could be at the last for some names. I need some kind of grouping but not sure how to do.
Upvotes: 0
Views: 156
Reputation: 7928
I don't know what your matchname scalar function does so I just created a generic scalar function that compares two strings and returns a small number.
-- (0) Prep: a matchname function
IF OBJECT_ID('tempdb.dbo.matchname') IS NOT NULL DROP FUNCTION dbo.matchname;
GO
CREATE FUNCTION dbo.matchname(@string1 varchar(40), @string2 varchar(40))
RETURNS int AS
BEGIN RETURN((ABS(ASCII(@string1)+3) - (ASCII(@string2))))*7 END;
Below is some sample data and a solution. The most important thing to note is how I filtered my CROSS JOIN:
WHERE x.someid < y.someid
Doing this prevents you from evaluating the same records twice; e.g. dbo.matchname(John Smith, George Washington) & dbo.matchname(George Washington, John Smith.
Sample data and solution
-- Sample data
DECLARE @yourtable TABLE
(
someid int identity primary key clustered,
[Name] varchar(40),
Village varchar(10)
,index nc_yt nonclustered([Name] ASC)
);
INSERT @yourtable ([Name], Village)
VALUES
('Jones Ashley', 'MPK'),
('Meyer Peter', 'JSK'),
('A Jones', 'MPK'),
('David', 'ARK'),
('Peter M', 'JSK'),
('Peter M', 'JSK'),
('David', 'ARK');
-- Solution
WITH uniqueList AS
(
select
rn = ROW_NUMBER() OVER
(
PARTITION BY x.Name, y.Name, dbo.matchname(x.name, y.name)
ORDER BY (SELECT NULL)
),
Name1 = x.Name,
Name2 = y.Name,
id1 = x.someid, id2 = y.someid,
dbo.matchname(x.name, y.name) AS match
from @yourtable x
CROSS JOIN @yourtable y
WHERE x.someid < y.someid
AND dbo.matchname(x.Name, y.Name) >= 80
)
SELECT Name1, Name2, match
FROM uniqueList
WHERE rn = 1
ORDER BY match;
Now about that scalar valued function... Scalar Valued User Defined Functions (scalar UDF for short) KILL performance, especially how you're using yours!. You can replace scalar UDFs with inline table valued functions (iTVF) for optimal performance.
First the new function:
IF OBJECT_ID('tempdb.dbo.itvf_matchname') IS NOT NULL DROP FUNCTION dbo.itvf_matchname;
GO
CREATE FUNCTION dbo.itvf_matchname(@string1 varchar(40), @string2 varchar(40))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN(SELECT match = (ABS(ASCII(@string1)+3) - (ASCII(@string2)))*7);
Now a solution (note that I commented out the original scalar UDF code):
-- sample data
DECLARE @yourtable TABLE
(
someid int identity primary key clustered,
[Name] varchar(40),
Village varchar(10)
,index nc_yt nonclustered([Name] ASC)
);
INSERT @yourtable ([Name], Village)
VALUES
('Jones Ashley', 'MPK'),
('Meyer Peter', 'JSK'),
('A Jones', 'MPK'),
('David', 'ARK'),
('Peter M', 'JSK'),
('Peter M', 'JSK'),
('David', 'ARK');
-- solution
WITH uniqueList AS
(
select
rn = ROW_NUMBER() OVER
(
PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match
ORDER BY (SELECT NULL)
),
Name1 = x.Name,
Name2 = y.Name,
--dbo.matchname(x.name, y.name) AS match
itvf.match
from @yourtable x
CROSS JOIN @yourtable y
-- Below: only 1 function call with results referenced multiple times
CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
WHERE x.someid < y.someid
--AND dbo.matchname(x.Name, y.Name) >= 80
AND itvf.match >= 80
)
SELECT Name1, Name2, match
FROM uniqueList
WHERE rn = 1;
The results are identical but the performance is profoundly better. To better understand why you should replace scalar UDFs with iTVFs let's do a 1500 row test (which means we'll be evaluating 1 million+ rows):
-- (3.1) Sample Data with an ID
SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#yourtable') IS NOT NULL DROP TABLE #yourtable;
CREATE TABLE #yourtable
(
someid int identity primary key clustered,
[Name] varchar(40) NOT NULL,
Village varchar(10) NOT NULL
);
INSERT #yourtable
SELECT TOP (1500) LEFT(REPLACE(newid(),'-',''),10), 'xxx'
FROM sys.all_columns a
CROSS JOIN sys.all_columns b;
GO
CREATE NONCLUSTERED INDEX nc_yt ON #yourTable([Name] ASC);
GO
PRINT 'Scalar function'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
select
rn = ROW_NUMBER() OVER
(
PARTITION BY x.Name, y.Name, dbo.matchname(x.name, y.name)
ORDER BY (SELECT NULL)
),
Name1 = x.Name,
Name2 = y.Name,
dbo.matchname(x.name, y.name) AS match
from #yourtable x
CROSS JOIN #yourtable y
WHERE x.someid < y.someid
AND dbo.matchname(x.Name, y.Name) >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1;
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT char(13)+char(10)+'ITVF (serial)'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
select
rn = ROW_NUMBER() OVER
(
PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match
ORDER BY (SELECT NULL)
),
Name1 = x.Name,
Name2 = y.Name,
--dbo.matchname(x.name, y.name) AS match
itvf.match
from #yourtable x
CROSS JOIN #yourtable y
-- Below: only 1 function call with results referenced multiple times
CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
WHERE x.someid < y.someid
--AND dbo.matchname(x.Name, y.Name) >= 80
AND itvf.match >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS, @st, getdate());
GO 5
PRINT char(13)+char(10)+'ITVF Parallel'+char(13)+char(10)+REPLICATE('-',50);
GO
DECLARE @x bit, @st datetime2 = getdate();
WITH uniqueList AS
(
select
rn = ROW_NUMBER() OVER
(
PARTITION BY x.Name, y.Name, /*dbo.matchname(x.name, y.name)*/ itvf.match
ORDER BY (SELECT NULL)
),
Name1 = x.Name,
Name2 = y.Name,
--dbo.matchname(x.name, y.name) AS match
itvf.match
from #yourtable x
CROSS JOIN #yourtable y
-- Below: only 1 function call with results referenced multiple times
CROSS APPLY dbo.itvf_matchname(x.Name, y.Name) itvf
CROSS APPLY dbo.make_parallel()
WHERE x.someid < y.someid
--AND dbo.matchname(x.Name, y.Name) >= 80
AND itvf.match >= 80
)
SELECT @x = 1
FROM uniqueList
WHERE rn = 1;
PRINT DATEDIFF(MS, @st, getdate());
GO 5
And the results:
Scalar function
--------------------------------------------------
Beginning execution loop
4627
4504
4440
4457
4550
Batch execution completed 5 times.
ITVF (serial)
--------------------------------------------------
Beginning execution loop
1623
1610
1643
1640
1713
Batch execution completed 5 times.
ITVF Parallel
--------------------------------------------------
Beginning execution loop
1306
1067
1077
1127
1047
Batch execution completed 5 times.
The iTVF based solution was ~3X faster when run with a serial plan, 4X faster with a parallel plan.
Upvotes: 1