Reputation: 682
I would like to ask for help regarding how to do a partial string comparison of two address values stored in two separate columns. I need to only identify the values that match even if the match is not 100%. I do not care for the value for the street direction: N, S, E, W, or whether it is a suite, apartment, nor the type of street (rd, st, dr, cr, etc). Perhaps the requirement is fulfilled by only matching the first value, the house number.
An example would be:
Column1 Column2
17 Wickham CT 17 S WICKHAM CT # 2 << This is a partial match, include
6818 Chester DR 6801 CHESTER DR # A << This is a partial match, include
6301 Raymond RD 6301 RAYMOND RD << This is a full match, include
6217 Raymond RD PO BOX 45581 << This doesn't match, don't include
I have the query that retrieves everything, I just need to figure out how to filter the records that I don't need.
If I could get any help on this, it would be awesome.
Thank you.
UPDATE
I think I am understanding the data a bit better. I hope I can get help for the following context.
OwnerID Addr_Nbr Address_A Address_B
3336223 2204138 11 Westbrook CIR 11 WESTBROOK CIR
3336223 2431628 9 Westbrook CIR 11 WESTBROOK CIR
3337465 2328720 4214 School RD 4214 SCHOOL RD
In that scenario, from the first two records the first one seem to be a match, but it is not, because that owner has more than one address, the third record is a good match because the address matches and the owner only has one address. How can I select only the records where the address matches, and there is only one address number for the owner? I hope this makes sense.
Thank you again
Upvotes: 0
Views: 1312
Reputation: 351
You can also try using Pearson's correlation coefficient
Perfect example for MS SQL is here.
Here is my quick assembled example:
--helper function to convert string to ASCII column
CREATE FUNCTION dbo.fn_StringToASCII
(
@text VARCHAR(MAX)
)
RETURNS @Result TABLE
(
POS INT,
NUM INT
)
AS
BEGIN
DECLARE @i INT
SET @i = 1
WHILE @i <= LEN(@text)
BEGIN
INSERT INTO @Result
(
POS,
NUM
)
VALUES
(@i, ASCII(SUBSTRING(@text, @i, 1)))
SET @i = @i + 1
END
RETURN;
END;
-- test example
CREATE TABLE test1(ID INT, ADDR1 VARCHAR(20));
CREATE TABLE test2(ID INT, ADDR2 VARCHAR(20));
INSERT INTO dbo.test1
(
ID,
ADDR1
)
VALUES
(1, '17 Wickham CT'),
(2, '6818 Chester DR'),
(3, '6217 Raymond RD'),
(4, 'TEST');
INSERT INTO dbo.test2
(
ID,
ADDR2
)
VALUES
(1, '17 S WICKHAM CT # 2'),
(2, '6801 CHESTER DR # A'),
(3, 'PO BOX 45581'),
(4, 'TEST');
--query with coeff
SELECT ISNULL(t1.ID, c2.ID) AS ID,
(AVG(c1.NUM * c2.NUM) - (AVG(c1.NUM) * AVG(c2.NUM))) / (STDEVP(c1.NUM) * STDEVP(c2.NUM)) AS Coeff
FROM dbo.test1 t1
CROSS APPLY dbo.fn_StringToASCII(LOWER(t1.ADDR1)) c1
RIGHT JOIN
(
SELECT t2.ID,
c2.*
FROM dbo.test2 t2
CROSS APPLY dbo.fn_StringToASCII(LOWER(t2.ADDR2)) c2
) c2
ON c2.ID = t1.ID
AND c2.POS = c1.POS
WHERE 1 = 1
GROUP BY ISNULL(t1.ID, c2.ID);
DROP TABLE dbo.test1
DROP TABLE dbo.test2
/*
results
ID Coeff
1 0.957280794307261
2 1.58310202187124
3 -0.397204343866094
4 0.987654320987654
*/
Upvotes: 0
Reputation: 61
Try using Levenshtein distance. This link gives a few examples of how to implement it using T-SQL.
Levenshtein distance basically gives you a value for the differences between two strings. You can come up with an acceptable value for your query and return any that hits that acceptable threshold.
Upvotes: 1