Andres Urrego Angel
Andres Urrego Angel

Reputation: 1932

Issues reading a string with weird characters in SQL Server 2008 and previous versions

Here is my issue today: we load some info from text files into a SQL Server table. The thing is that for a field I realized that the string value is a little bit weird.

When I query the table with a SQL Server client 2008 or previous, I got this result set:

enter image description here

Even if I try to run a query like this one below, the resultset is empty:

select REPLACE(LTRIM(RTRIM(cust_po)), '  ', ' ') 
from dbo.test_char
where cust_po like '%076929%'

The weird thing is: with SQL Server 2016, just opening the table looks normal, although the query doesn't return any results either:

enter image description here

Now, following a couple of instruction from @SolomonRutzky I have performed the query below to get a result in varbinary

enter image description here

the new result is : 0x300037003600390032003900BC05BC05BC05BC05BC05BC05BC05BC05BC05

how could get that as a varchar without the weird chars or space or wherever that is?

Upvotes: 1

Views: 3294

Answers (2)

Solomon Rutzky
Solomon Rutzky

Reputation: 48884

What is going on?

Now that the VARBINARY representation of the data has been posted, the issue is much clearer.

As we can see, the two lines with the "odd" characters are:

0x300037003600390032003900BC05BC05BC05BC05BC05BC05BC05BC05BC05

This column is NVARCHAR which means that the encoding is UTF-16 Little Endian. Being UTF-16 means that we look at each block of 2 bytes (characters will either be 1 or 2 sets of those 2-byte blocks), and being Little Endian means that the bytes in each 2-byte block are in reverse order. Meaning, the first character is 3000 (which is 2 bytes), and that equates to Code Point U+0030, which is the number 0. The next character is 3700 which is Code Point U+0037 which is the number 7. And so on down through 3900 which is the final 9 in the value.

But then we get to nine sets of BC05 which is Code Point U+05BC, which is Hebrew Point Dagesh or Mapiq. And this is where it gets interesting as there are three separate things going on here:

  1. What is displayed is a function of the font being used. And not all fonts handle all characters correctly, or even have mappings for all characters. This is why you see a series of circles to the left in the Results Grid, and a single dot on the far left in the row editor (hard to see, especially if you don't know to look for it, but look closely in the row editor image and the top two lines both have a tiny dot to the left of the 0).
  2. Hebrew is a Right-to-Left language, which explains why the circles in the Results Grid are on the left, even though the characters in the data come after the numbers.
  3. This particular character is a combining character, which means that it is supposed to attach to the character that precedes it. Since different fonts display characters differently you get the variation that you are seeing between the Results Grid and row editor. The Results Grid font does not seem to be handling this character correctly, while the row editor does. But, in handling it correctly, we see that this combining character does not allow for showing multiples, so all we see is a single dot on the far left instead of 9 dots (but all 9 are there, just one on top of the other.

To see this in action, the following query:

SELECT N'4   11' + REPLICATE(NCHAR(0x05BC), 10) + N'  88  f ';

returns:

4 11ּּּּּּּּּּ 88 f

Notice the single dot on the left side of the second 1. There is only a single dot even though the string has 10 of them (due to the REPLICATE). Well, that is what my browser is showing, captured as the following image:

enter image description here

Yet I see the following in my Results Grid:

enter image description here

Notice how the dot is to the left of the first, not second, 1, AND the placement of the 11 between the 88 and the f,

And, if I copy and paste from the Results Grid into the Query Editor, I see:

enter image description here

Notice that there are several red dots.

Why doesn't where cust_po like '%076929%' return any rows?

This is due to the string comparison doing what it should do and applying linguistic rules. It doesn't matter what order the bytes are in, what matters is how that rendered string is viewed from a human reading it perspective. And since this particular character is a combining character, it does not come after the preceding character, it is part of it. Meaning, the second 9 in your 076929 value isn't a 9 anymore, it is 9 + Dagesh.

SELECT 1 WHERE
N'123' + REPLICATE(NCHAR(0x05BC), 5) LIKE N'%123' + REPLICATE(NCHAR(0x05BC), 4) + N'%';
-- no rows returned

SELECT 2 WHERE
N'123' + REPLICATE(NCHAR(0x05BC), 5) LIKE N'%123' + REPLICATE(NCHAR(0x05BC), 5) + N'%';
-- 2

SELECT 3 WHERE N'9' + NCHAR(0x05BC) = N'9'
-- no rows returned

How do you get rid of those characters?

Assuming that this U+05BC character is the only issue in your data, you can do a simple replace. You have already tried this, but REPLACE isn't finding a match. When this happens, you need to use a binary Collation as shown here:

SELECT REPLACE(N'4   11' + REPLICATE(NCHAR(0x05BC), 5) + N'  88  f ',
               NCHAR(0x05BC) COLLATE Latin1_General_100_BIN2,
               N'~');
-- 4   11~~~~~  88  f 

Upvotes: 3

Alan Burstein
Alan Burstein

Reputation: 7928

First - you're in good hands when you have @solomon Rutzky looking at your code ;)

I don't have an NVarchar version of this at the moment but, if you only want the numeric values in your column you could use DigitsOnlyEE. The solution would look something like this.

-- sample data with a couple characters to screw you up. 
DECLARE @table TABLE (somestring nvarchar(100));
INSERT @table
SELECT CONCAT(NCHAR(365), '123') UNION ALL
SELECT CONCAT(CHAR(0), '789');

-- solution
SELECT *
FROM @table t
CROSS APPLY dbo.DigitsOnlyEE(t.somestring);

Results:

somestring   DigitsOnly
------------ ------------
ŭ123         123
 789         789

Upvotes: 1

Related Questions