Reputation: 682
I have some Unicode characters in an NVarchar
field named "PostalCode". When I convert them to Varchar
, there is a ?
in the result.
My code is:
select PostalCode, cast((PostalCode) as varchar)) as val from table
and the result is:
PostalCode | val
053000 | 053000?
Here I am getting a ?
in the result. Is there any way to remove such special characters?
Upvotes: 4
Views: 22296
Reputation: 48914
There are a few things to note here:
If you want to see exactly which character is there, you can convert the value to VARBINARY
which will give you the hex / binary value of all characters in the string and there is no concept of "hidden" characters in hex:
DECLARE @PostalCode NVARCHAR(20);
SET @PostalCode = N'053000'+ NCHAR(0x2008); -- 0x2008 = "Punctuation Space"
SELECT @PostalCode AS [NVarCharValue],
CONVERT(VARCHAR(20), @PostalCode) AS [VarCharValue],
CONVERT(VARCHAR(20), RTRIM(@PostalCode)) AS [RTrimmedVarCharValue],
CONVERT(VARBINARY(20), @PostalCode) AS [VarBinaryValue];
Returns:
NVarCharValue VarCharValue RTrimmedVarCharValue VarBinaryValue
053000 053000? 053000? 0x3000350033003000300030000820
NVARCHAR
data is stored as UTF-16 which works in 2-byte sets. Looking at the last 4 hex digits to see what the hidden 2-byte set is, we see "0820". Since Windows and SQL Server are UTF-16 Little Endian (i.e. UTF-16LE), the bytes are in reverse order. Flipping the final 2 bytes -- 08
and 20
-- we get "2008", which is the "Punctuation Space" that we added via NCHAR(0x2008)
.
Also, please note that RTRIM
did not help at all here.
Simplistically, you can just replace the question marks with nothing:
SELECT REPLACE(CONVERT(VARCHAR(20), [PostalCode]), '?', '');
More importantly, you should convert the [PostalCode]
field to VARCHAR
so that it doesn't store these characters. No country uses letters that are not represented in the ASCII character set and that are not valid for the VARCHAR datatype, at least as far as I have ever read about (see bottom section for references). In fact, what is allowed is a rather small subset of ASCII, which means you can easily filter on the way in (or just do the same REPLACE
as shown above when inserting or updating):
ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR(20) [NOT]? NULL;
Be sure to check the current NULL
/ NOT NULL
setting for the column and make it the same in the ALTER statement above, else it could be changed as the default is NULL
if not specified.
If you cannot change the schema of the table and need to do a periodic "cleansing" of the bad data, you can run the following:
;WITH cte AS
(
SELECT *
FROM TableName
WHERE [PostalCode] <>
CONVERT(NVARCHAR(50), CONVERT(VARCHAR(50), [PostalCode]))
)
UPDATE cte
SET cte.[PostalCode] = REPLACE(CONVERT(VARCHAR(50), [PostalCode]), '?', '');
Please keep in mind that the above query is not meant to work efficiently if the table has millions of rows. At that point it would need to be handled in smaller sets via a loop.
For reference, here is the wikipedia article for Postal code, which currently states that the only characters ever used are:
- The arabic numerals "0" to "9"
- Letters of the ISO basic Latin alphabet
- Spaces, hyphens
And regarding the max size of the field, here is the Wikipedia List of postal codes
Upvotes: 5
Reputation: 9151
If you want to remove only that last special character, leaving the rest the same, you can try one of the following to remove it before you convert to varchar
:
Maybe that last special character is considered a space. Try RTRIM:
cast(rtrim(PostalCode) as varchar))
First, find the Unicode code point of that last character from an example row of your table, using RIGHT and UNICODE:
select unicode(right(PostalCode, 1)) from …
This should give you an integer number.
Next, drop that character from your rows when not needed. We can convert that integer number back into a unicode character with NCHAR:
select (case when right(PostalCode, 1) = nchar(the_number_from_above)
then left(PostalCode, len(PostalCode) - 1)
else PostalCode end)
Upvotes: 0
Reputation: 9151
As long as "?" is not allowed in a real PostalCode value, you can cast first, and then remove these characters using REPLACE
to replace them with the empty string:
replace(cast((PostalCode) as varchar))), '?', '')
These '?'
characters represent Unicode characters in the original nvarchar
value that were not converted to an equivalent ASCII varchar
character. This means that this method will silently remove any such characters. You said you wanted to simply remove these characters, but you might want to rethink that.
To give an example, if the postal code can contain letters, and someone accidentally typed a letter with an accent:
1234-ÁBCD
The end result would be:
1234-BCD
Upvotes: 1
Reputation: 35318
No, Unicode is Unicode. The standard allows for innumerable "special" characters beyond ASCII. The best you can do is search for your desired subset of characters before the conversion and convert them to your favorite ASCII stand-ins.
Upvotes: 0