Reputation: 2381
I am looking for some built-in function which is able to find out in how many character two string differ.
For example:
CharDiff('SO0035F', 'SO005F') = 1
CharDiff('12345', '34512') = 0
Upvotes: 4
Views: 6070
Reputation: 453908
A more set based version of Scott's answer is
DECLARE @String1 VARCHAR(10) = '1234',
@String2 VARCHAR(10) = '1243'
SELECT COUNT(*)
FROM GENERATE_SERIES(1, GREATEST(LEN(@String1 + 'X'), LEN(@String2 + 'X')) - 1)
WHERE SUBSTRING(@String1, value, 1) <> SUBSTRING(@String2, value, 1)
This requires SQL Server 2022+
The same approach of splitting the string into characters with GENERATE_SERIES
can also be used as a first step for different definitions of CharDiff
Upvotes: 0
Reputation: 1
In line with the answer of Scott R. Frost. This solution shows a comparison of '1234' to '1243' as 2 characters different even though they contain the same characters.
For strings up to 8 characters long:
SELECT
LEN(REPLACE(CONVERT(varchar(8),
CONVERT(binary(8),
CONVERT(bigint, CONVERT(binary(8), String1)) ^ CONVERT(binary(8), String2)
)
) COLLATE Latin1_General_BIN,
char(0),
'')) AS XORCharDiff
FROM (VALUES ('Example1', 'Example2')) AS X(String1, String2)
For all string lengths:
CREATE FUNCTION dbo.XORCharDiff (@string1 VARCHAR(MAX), @string2 VARCHAR(MAX))
RETURNS INT
AS
BEGIN
DECLARE @diffs INT = 0
WHILE LEN(@string1) > 0 AND LEN(@string2) > 0
BEGIN
SELECT @diffs = @diffs +
LEN(REPLACE(CONVERT(varchar(8),
CONVERT(binary(8),
CONVERT(bigint, CONVERT(binary(8), SUBSTRING(@string1, 1, 8))) ^ CONVERT(binary(8), SUBSTRING(@string2, 1, 8))
)
) COLLATE Latin1_General_BIN,
char(0),
''))
SELECT @string1 = SUBSTRING(@string1,9,8)
SELECT @string2 = SUBSTRING(@string2,9,8)
END
RETURN @diffs + LEN(@string1) + LEN(@string2)
END
The string comparison is based on a bitwise XOR of binary representations of both strings. In case of matching characters in the two strings, the characters 'cancel each other out' resulting in a 0x00/char(0)
Example:
01001000 01101001 'Hi' (String1)
01001000 01100001 'Ha' (String2)
00000000 00001000 resulting bitwise XOR
By removing the char(0), the length of string is reduced by the number of matching characters.
It can all be done as in a single line without the need of a loop and a function. Which results in a high speed solution.
Some further explanation:
The Bigint
conversion is needed since bitwise XOR does not work on 2 binaries. Because Bigint has 8 Bytes, this sets the limit to 8 characters per comparison.
COLLATE Latin1_General_BIN
is needed because otherwise unprintable characters are removed.
See below the timing comparison between this solution and CharDiff function (for 100k of 8 character long strings).
XOR CharDiff (single line max 8 characters)
SQL Server Execution Times:
CPU time = 110 ms, elapsed time = 109 ms.
(100000 rows affected)
XOR CharDiff function (all string lengths)
SQL Server Execution Times:
CPU time = 3188 ms, elapsed time = 3979 ms.
(100000 rows affected)
CharDiff
SQL Server Execution Times:
CPU time = 7828 ms, elapsed time = 10442 ms.
(100000 rows affected)
Upvotes: 0
Reputation: 2061
I needed something slightly different. I needed a comparison of '1234' to show as 2 characters different than '1243' even though they contain the same characters.
I came up with the following:
CREATE FUNCTION dbo.CharDiff (@string1 NVARCHAR(MAX), @string2 NVARCHAR(MAX))
RETURNS INT
AS
BEGIN
DECLARE @diffs INT = 0
WHILE LEN(@string1) > 0 AND LEN(@string2) > 0
BEGIN
IF SUBSTRING(@string1,1,1) <> SUBSTRING(@string2,1,1)
SELECT @diffs = @diffs + 1
SELECT @string1 = SUBSTRING(@string1,2,9999)
SELECT @string2 = SUBSTRING(@string2,2,9999)
END
RETURN @diffs + LEN(@string1) + LEN(@string2)
END
Upvotes: 4
Reputation: 4117
As this page lists all string functions available in SQL Server, I am pretty sure there is no built-in functionality for exactly this use case available.
However, with a little help from this post I came up with the following which seems to fit your needs:
CREATE FUNCTION dbo.CharDiff (@string1 NVARCHAR(MAX), @string2 NVARCHAR(MAX))
RETURNS INT
AS
BEGIN
DECLARE @allDifferences INT = 0
DECLARE @charCount1 INT
DECLARE @charCount2 INT
--do this as long as both strings are longer than 0
WHILE LEN(@string1) > 0 AND LEN(@string2) > 0
BEGIN
--get char count for the character at index 1 in string 1
SELECT @charCount1 = (LEN(@string1) - LEN(REPLACE(@string1, SUBSTRING(@string1, 1, 1), '')))
--get char count for the character at index 1 in string 1 but for string2
SELECT @charCount2 = (LEN(@string2) - LEN(REPLACE(@string2, SUBSTRING(@string1, 1, 1), '')))
--strip all chars that now have been counted from string 2
SELECT @string2 = REPLACE(@string2, SUBSTRING(@string1, 1, 1),'')
--strip all chars that now have been counted from string 1
SELECT @string1 = REPLACE(@string1, SUBSTRING(@string1, 1, 1),'')
--add difference to counting variable
SELECT @allDifferences = @allDifferences + ABS(@charCount1 - @charCount2)
END
--is there any rest length on any of those string?
SELECT @allDifferences = @allDifferences + ABS(LEN(@string1) - LEN(@string2))
RETURN @allDifferences
END
Basically I just count the occurence of the character that currently exists on index 1 in the first string in both strings.
Then I delete all those already counted characters from both strings (that way index 1 holds another character each iteration), as long as both strings have any characters in them. The rest length of a string that might still exist after that loop can simply be added to @allDifferences
.
Upvotes: 2