Reputation: 1932
Here is my issue today: we load some info from text files into a SQL Server table. The thing is that for a field I realized that the string value is a little bit weird.
When I query the table with a SQL Server client 2008 or previous, I got this result set:
Even if I try to run a query like this one below, the resultset is empty:
select REPLACE(LTRIM(RTRIM(cust_po)), ' ', ' ')
from dbo.test_char
where cust_po like '%076929%'
The weird thing is: with SQL Server 2016, just opening the table looks normal, although the query doesn't return any results either:
Now, following a couple of instruction from @SolomonRutzky I have performed the query below to get a result in varbinary
the new result is : 0x300037003600390032003900BC05BC05BC05BC05BC05BC05BC05BC05BC05
how could get that as a varchar without the weird chars or space or wherever that is?
Upvotes: 1
Views: 3294
Reputation: 48884
Now that the VARBINARY
representation of the data has been posted, the issue is much clearer.
As we can see, the two lines with the "odd" characters are:
0x300037003600390032003900BC05BC05BC05BC05BC05BC05BC05BC05BC05
This column is NVARCHAR
which means that the encoding is UTF-16 Little Endian. Being UTF-16 means that we look at each block of 2 bytes (characters will either be 1 or 2 sets of those 2-byte blocks), and being Little Endian means that the bytes in each 2-byte block are in reverse order. Meaning, the first character is 3000
(which is 2 bytes), and that equates to Code Point U+0030, which is the number 0
. The next character is 3700
which is Code Point U+0037 which is the number 7
. And so on down through 3900
which is the final 9
in the value.
But then we get to nine sets of BC05
which is Code Point U+05BC, which is Hebrew Point Dagesh or Mapiq. And this is where it gets interesting as there are three separate things going on here:
0
).To see this in action, the following query:
SELECT N'4 11' + REPLICATE(NCHAR(0x05BC), 10) + N' 88 f ';
returns:
Notice the single dot on the left side of the second 1
. There is only a single dot even though the string has 10 of them (due to the REPLICATE
). Well, that is what my browser is showing, captured as the following image:
Yet I see the following in my Results Grid:
Notice how the dot is to the left of the first, not second, 1
, AND the placement of the 11
between the 88
and the f
,
And, if I copy and paste from the Results Grid into the Query Editor, I see:
Notice that there are several red dots.
where cust_po like '%076929%'
return any rows?This is due to the string comparison doing what it should do and applying linguistic rules. It doesn't matter what order the bytes are in, what matters is how that rendered string is viewed from a human reading it perspective. And since this particular character is a combining character, it does not come after the preceding character, it is part of it. Meaning, the second 9
in your 076929
value isn't a 9
anymore, it is 9
+ Dagesh.
SELECT 1 WHERE
N'123' + REPLICATE(NCHAR(0x05BC), 5) LIKE N'%123' + REPLICATE(NCHAR(0x05BC), 4) + N'%';
-- no rows returned
SELECT 2 WHERE
N'123' + REPLICATE(NCHAR(0x05BC), 5) LIKE N'%123' + REPLICATE(NCHAR(0x05BC), 5) + N'%';
-- 2
SELECT 3 WHERE N'9' + NCHAR(0x05BC) = N'9'
-- no rows returned
Assuming that this U+05BC character is the only issue in your data, you can do a simple replace. You have already tried this, but REPLACE
isn't finding a match. When this happens, you need to use a binary Collation as shown here:
SELECT REPLACE(N'4 11' + REPLICATE(NCHAR(0x05BC), 5) + N' 88 f ',
NCHAR(0x05BC) COLLATE Latin1_General_100_BIN2,
N'~');
-- 4 11~~~~~ 88 f
Upvotes: 3
Reputation: 7928
First - you're in good hands when you have @solomon Rutzky looking at your code ;)
I don't have an NVarchar version of this at the moment but, if you only want the numeric values in your column you could use DigitsOnlyEE. The solution would look something like this.
-- sample data with a couple characters to screw you up.
DECLARE @table TABLE (somestring nvarchar(100));
INSERT @table
SELECT CONCAT(NCHAR(365), '123') UNION ALL
SELECT CONCAT(CHAR(0), '789');
-- solution
SELECT *
FROM @table t
CROSS APPLY dbo.DigitsOnlyEE(t.somestring);
Results:
somestring DigitsOnly
------------ ------------
ŭ123 123
789 789
Upvotes: 1