SQL Server: similar records based on 2 detail tables

Question

I need to find similar records in SQL Server 2016, and it would be great if I could order result by how much they match. Match criteria for similar are 2 detail tables.

My table structures are:

//Basic:
Id      Title
1       Title 1
2       Title 2
3       Title 3
4       Title 4
5       Title 5

//Parameters:
Id      Title
1       Param 1
2       Param 2
3       Param 3
4       Param 4
5       Param 5

//Values:
Id      Value
1       Val 1
2       Val 2
3       Val 3
4       Val 4
5       Val 5

And connection tables:

BasicId     ParameterId     ValueId
1           1               2
1           3               1
1           4               5
2           1               1
2           2               4
2           3               2
3           1               2
3           3               1
3           4               4
4           1               2
4           4               2
4           5               2
5           1               1
5           2               5
5           3               3

At some point in my program I create new Basic row (in memory) with Parameters and values. Now I need to know:

Is there any record in database which match exactly the same parameters and values.
Similar records ordered by how many parameters and values matches.

For first point I do it like this: 2 rows are equal if all parameters and there values matches. For now I calculate string with Ids of all parameters and values and add it to Basic table.

Basic becomes:

Id          Title           EqualString (this comes from connection table)
1           Title 1         1:2;3:1;4:5     
2           Title 2         1:1;2:4;3:2
3           Title 3         1:2;3:1;4:4
4           Title 4         1:2;4:2;5:2
5           Title 5         1:1;2:5;3:3

This works great, since order is defined by ParameterId ASC and I have index on EqualString column. At this point I should say, that my basic table have arround 5 million records and growing.

But for second problem, I don't have a clue, how this kind of problem could be solved.

For example:

Input Title X1    1:1;2:4;3:2       100%        Title 2
                                    33%         Title 5

Input Title X2    1:1;2:4;3:1       66%         Title 2
                                    33%         Title 5

Input Title X3    1:2;3:1;5:5       66%         Title 1
                                    66%         Title 3

Input Title X4    1:2;3:1;5:3       66%         Title 3
                                    33%         Title 5

I use SQL Server 2016 and have complete control of computer on which database is installed.
There are around 100 parameters-values to check if they match (in upper example only 3 - they are separated with ;)

I think full text search is not the answer (or maybe it is), because I need exact match between 2 commas (2:5 for example).

Or I should go with completely different approach without calculating EqualString (on insert witch is rarely).

Use case:

I am building web page where product that sell is dynamic. That means, that user configure it before it buys.
For example: set height, width, colour, choose material ... Now I need to know if this product already exists (someone else choose exact same parameters (width, height, colour) and values (100, 120, green) or this is completely new product.
Parameters which was configured are in parameters table and values are in values table.

Alex · Accepted Answer

Test data

This is copied verbatim from the question:

CREATE TABLE #Connection( BasicId INT, ParameterId INT, ValueId INT )
INSERT INTO #Connection
VALUES
( 1, 1, 2 ), ( 1, 3, 1 ), ( 1, 4, 5 ),
( 2, 1, 1 ), ( 2, 2, 4 ), ( 2, 3, 2 ),
( 3, 1, 2 ), ( 3, 3, 1 ), ( 3, 4, 4 ),
( 4, 1, 2 ), ( 4, 4, 2 ), ( 4, 5, 2 ),
( 5, 1, 1 ), ( 5, 2, 5 ), ( 5, 3, 3 )

This is sample data for new products

CREATE TABLE #NewConnection( BasicId INT, ParameterId INT, ValueId INT )
INSERT INTO #NewConnection
VALUES
( 10, 1, 2 ), ( 10, 2, 2 ), ( 10, 3, 1 ),
( 11, 1, 1 ), ( 11, 2, 4 ), ( 11, 3, 2 ),
( 12, 1, 0 ), ( 12, 2, 5 ), ( 12, 3, 3 )

The query

SELECT NC.BasicID AS NewBasicID, C.BasicID, C.ParameterId, C.ValueId,
    CONVERT( DECIMAL( 5, 0 ), COUNT( C.ParameterId ) OVER( PARTITION BY NC.BasicID, C.BasicID )) / ParamCount * 100 AS MatchStrength
FROM
        ( SELECT *, COUNT( ParameterId ) OVER( PARTITION BY BasicID ) AS ParamCount
        FROM #NewConnection ) AS NC
    INNER JOIN #Connection AS C
            ON NC.ParameterId = C.ParameterId AND C.ValueId = NC.ValueId
ORDER BY NewBasicID, MatchStrength DESC, C.BasicID, C.ParameterId

How it works

#NewConnection - contains "New Connection" data that needs to be checked against existing #Connection table
ParamCount - counts the number of properties for each "new" BasicID. Needed to calculate MatchStrength
INNER JOIN - joins "new" connections with existing ones on ParameterIDs and ValueIDs
MatchStrength - counts matched ParameterId and ValueID combinations for each NC.BasicID, C.BasicID group, then divides by the ParamCount to get the percentage matched.

References:

Note: performance for large data sets has not been tested but you may need the following indexes: { parameterID, valueID }, { BasicID }

Query Variations

If you want to return distinct matched BasicIDs and MatchStrength (no parameterIDs) then use the following:

SELECT DISTINCT NewBasicID, BasicID, MatchStrength
FROM
    ( SELECT NC.BasicID AS NewBasicID, C.BasicID, C.ParameterId, C.ValueId,
        CONVERT( DECIMAL( 5, 0 ), COUNT( C.ParameterId ) OVER( PARTITION BY NC.BasicID, C.BasicID )) / ParamCount * 100 AS MatchStrength
    FROM
            ( SELECT *, COUNT( ParameterId ) OVER( PARTITION BY BasicID ) AS ParamCount
            FROM #NewConnection ) AS NC
        INNER JOIN #Connection AS C
                ON NC.ParameterId = C.ParameterId AND C.ValueId = NC.ValueId ) AS Matches
ORDER BY NewBasicID, MatchStrength DESC, BasicID

SQL Server: similar records based on 2 detail tables

Answers (1)

Test data

The query

How it works

Query Variations

Related Questions