Reputation: 39
I have these 2 tables:
CREATE TABLE [dbo].[Customer]
(
[CustomerName] VARCHAR(20) NOT NULL,
[CustomerLink] VARCHAR(40) NULL
)
CREATE TABLE [dbo].[CustomerIdentification]
(
[CustomerName] VARCHAR(20) NOT NULL,
[ID] VARCHAR(50) NOT NULL,
[IDType] VARCHAR(16) NOT NULL
)
And I've added some test data..
INSERT [dbo].[Customer]
([CustomerName])
VALUES ('Fred'),
('Bob'),
('Vince'),
('Tom'),
('Alice')
INSERT [dbo].[CustomerIdentification]
VALUES
('Fred', 'A', 'Passport'),
('Fred', 'A', 'SIN'),
('Fred', 'A', 'Drivers Licence'),
('Bob', 'A', 'Passport'),
('Bob', 'B', 'Drivers Licence'),
('Bob', 'C', 'Credit Card'),
('Vince', 'A', 'Passport'),
('Vince', 'B', 'SIN'),
('Vince', 'C', 'Credit Card'),
('Tom', 'A', 'Passport'),
('Tom', 'B', 'SIN'),
('Tom', 'A', 'Drivers Licence'),
('Alice', 'B', 'Drivers Licence')
Basically, a customer (from customer table) can have many pieces of identifications. For instance, Fred has passport, sin and drivers license and all 3 has value is A (the value could be different too).
This is the output I want:
Test 1: Passport with value A
From the table: first common identification is Passport with value A: 4 Customers have this ID. Fred, Bob, Vince and Tom and maybe they are all the same customer, if thats the case, we would like to link all 4 of them with a unique ID (guid) in customer table's CustomerLink field.
However, if 1 ID matches between customers, then others should match too if any of them have other common IDType. For example, Fred also has SIN with a value of A which also exist on Vince and Tom but with a value of B. So this group is not the same customer. No linking done.
Test 2: SIN with value B
Next common identification is SIN with value B which exist for Vince and Tom. Vince and Tom are indeed same customer since they also have Passport with same value (A). The 3rd identification for both have different IDType, credit card and drivers license. Hence they can be linked. So we link vince and Tom as same customer in customer table.
Test 3: Drivers license with value A
Which exists for Fred and Tom. Both have passport with value A too. Both have SIN common as well, however, the value of SIN is different for both. Its A for Fred and B for Tom. Hence they are not the same customer. No linking done.
Test 4: Drivers license with value B
Which exists for Bob and Alice. They are indeed the same customer since Bob and Alice both have Drivers license with value B. Bob has 2 other pieces of ID which alice does not have, and that is fine since those Ids are unique to Bob. So we link Bob and Alice as same customer in customer table.
Test 5: Credit Card with value C
Which exists for Bob and Vince.
But Bob is already linked to Alice so we have to bring Alice into the picture. And Vince is already linked to Tom so we have to bring Tom into the picture.
Now, Bob has Drivers License with value B and Tom also has Drivers license with value A. Different values but same ID (drivers license) cause this group to not link. Hence, no linking done, previous linkages remain.
In the end, we are left with Vince and Tom, AND, Bob and Alice being linked as same customers in customer table. So customer table may look like ...
CustomerName Customer Link
------------ -------------
Fred NULL
Bob YYYYYY
Vince XXXXXX
Tom XXXXXX
Alice YYYYYY
Upvotes: 0
Views: 180
Reputation: 57053
I had a second look but didn't get far…
First, fix the design from EAV to something 'more relational' (includes nulls but never mind!):
CREATE TABLE dbo.Customers
(
CustomerName VARCHAR(20) NOT NULL,
Passport CHAR(1),
SIN CHAR(1),
Drivers_Licence CHAR(1),
Credit_Card CHAR(1)
) ;
INSERT INTO dbo.Customers (CustomerName, Passport, SIN, Drivers_Licence, Credit_Card)
VALUES ('Fred', 'A', 'A', 'A', NULL),
('Bob', 'A', NULL, 'B', 'C'),
('Vince', 'A', 'B', NULL, 'C'),
('Tom', 'A', 'B', 'A', NULL),
('Alice', NULL, NULL, 'B', NULL);
Next, self-join to find pairs of customers that have at least one matching attribute:
SELECT c1.CustomerName, c2.CustomerName
FROM dbo.Customers c1
JOIN dbo.Customers c2
ON c1.CustomerName < c2.CustomerName
AND (
c1.Credit_Card = c2.Credit_Card
OR c1.Drivers_Licence = c2.Drivers_Licence
OR c1.Passport = c2.Passport
OR c1.SIN = c2.SIN
);
Whittle this down further by customer pairs that don't have contradictory data, giving 'benefit of the doubt' for nulls:
SELECT c1.CustomerName, c2.CustomerName
FROM dbo.Customers c1
JOIN dbo.Customers c2
ON c1.CustomerName < c2.CustomerName
AND (
c1.Credit_Card = c2.Credit_Card
OR c1.Drivers_Licence = c2.Drivers_Licence
OR c1.Passport = c2.Passport
OR c1.SIN = c2.SIN
)
INTERSECT
SELECT c1.CustomerName, c2.CustomerName
FROM dbo.Customers c1
JOIN dbo.Customers c2
ON c1.CustomerName < c2.CustomerName
AND COALESCE(c1.Credit_Card, c2.Credit_Card, 'x')
= COALESCE(c2.Credit_Card, c1.Credit_Card, 'x')
AND COALESCE(c1.Drivers_Licence, c2.Drivers_Licence, 'x')
= COALESCE(c2.Drivers_Licence, c1.Drivers_Licence, 'x')
AND COALESCE(c1.Passport, c2.Passport, 'x')
= COALESCE(c2.Passport, c1.Passport, 'x')
AND COALESCE(c1.SIN, c2.SIN, 'x')
= COALESCE(c2.SIN, c1.SIN, 'x');
This results in three pairs:
CustomerName CustomerName
-------------------- --------------------
Alice Bob
Bob Vince
Tom Vince
Next, we need to eliminate the pair Bob and Vince based on the logic, "Bob is already linked to Alice so we have to bring Alice into the picture. And Vince is already linked to Tom so we have to bring Tom into the picture," which suggests recursion or hierarchical (finding trees in forests?), at which point I bow out.
Upvotes: 0
Reputation: 115600
I have a question. What should happen if we had these rows:
INSERT [dbo].[CustomerIdentification]
VALUES
('Morheus', 'A', 'Passport'),
('Morheus', 'B', 'SIN'),
('Neo', 'B', 'SIN'),
('Neo', 'C', 'Drivers Licence'),
('Trinity', 'C', 'Drivers Licence'),
('Trinity', 'A', 'Passport') ;
Should they all 3 be placed in the same group (regarded as same customer)?
EDIT: OP replied that all 3 should be considered as the same customer.
And what should happen in this occasion:
INSERT [dbo].[CustomerIdentification]
VALUES
('Morheus', 'A', 'Passport'),
('Morheus', 'B', 'SIN'),
('Neo', 'B', 'SIN'),
('Neo', 'C', 'Drivers Licence'),
('Trinity', 'C', 'Drivers Licence'),
('Trinity', 'D', 'Passport') ; --- the only change from previous
Now, Morheus
should be the same customer as Neo
(same SIN
, no other conflicts).
And Neo
should be the same customer as Trinity
(same Drivers Licence
, no other conflicts).
But Trinity
and Morheus
have different passports.
EDIT: OP replied that all if we examine Morheus
and Neo
first, then they should be considered as one. Then, Trinity
should be rejected as her Passport clashes with Morheus
.
I assume that if we examine Neo
and Trinity
first, then they should be considered as one, and then Morheus
should be rejected, as his Passport clashes with Trinity
.
My conclusion is that the question is ill-defined in relational terms. It's not solvable using only relational logic. It probably can be solved using variables or analytic extensions e.g. ROW_NUMBER()
.
Upvotes: 1
Reputation: 57053
The relational operator you require is division, popularly known as "the supplier who supplies all parts".
Because you consider Bob who has three rows to be the same as Alice who has only one row, you should be be looking at division with remainder. Unlike the examples at that link, you need to derive your divisor table from your dividend table CustomerIdentification
. An empty divisor is not an issue in your case.
Upvotes: 1