Reputation: 1155
I came across a character, with ascii value 63, in an Excel workbook. It renders as a solid diamond containing a question mark (eg. black diamond with white question mark - �).
If I paste this into its own cell and use the Excel CODE
function, the ascii value is returned as 63. If I type a question mark into another cell, using the keyboard, and use the CODE
function, that also returns an ascii value of 63 - but the question mark renders "normally" - ?.
If I check the font on both those cells, they are using the same font (Calibri).
If I copy/paste the two characters (diamond question mark; normal question mark) into a SQL Server Management Studio query window, they continue to render differently, yet the ASCII
T-SQL function shows they both have the value 63.
If I copy/paste the two characters into the EditPlus text editor, version 2, the diamond question mark now renders as whitespace (nothing visible but a space).
The question
What is going on here? What is actually different about these two characters if it is not their character code?
Is it the character set being used per character? How can I determine the character set being used for each character? When I try and read about character sets in SQL Server it seems they apply per db/schema/table - then what is SSMS doing in order to process those two characters individually? There must be something inherent in the individual characters to distinguish them - not a table-level setting. (What does SSMS use by default for character encoding anyway, in a query window? Even if SSMS has a default character set for its query window, clearly these two characters are rendered differently). Likewise, when I read about Excel character encodings, it appears to be defined per document/file - so how is Excel rendering the same ascii character differently?
This is similar to a question on Stack Overflow where someone wanted to get SSMS to find and replace this mystery character, so I know from that question, Stack Overflow also renders the two characters differently - meaning 1. there's another application that renders them differently at a character level (web browser, Firefox in my case), and 2. you should be able to see the characters in the SQL code below:
declare @mysteryChar char(1) = '�', @knownChar char(1) = '?';
select 'mystery character', @mysteryChar union all
select 'known character', @knownChar union all
select 'mystery ascii value', cast(ascii(@mysteryChar) as varchar(16)) union all
select 'known ascii value', cast(ascii(@knownChar) as varchar(16)) union all
select 'matches knownchar', case when @mysteryChar = @knownChar then 'true' else 'false' end
If you can run T-SQL then great, the code will demonstrate the ascii values are the same, and the last line show SQL Server considers them equivalent characters, but even if you can't, the first line should show you how they render differently in your web browser.
PS. I can't find any way to output this diamond character, for example, using the CHAR
function in T-SQL. As mentioned, I found it in an Excel workbook and have only been able to copy/paste it into other applications to try and work out what's going on.
Upvotes: 1
Views: 303
Reputation: 46241
Characters "�" and "?" are different characters with different code points. To see the glyphs and code points with a T-SQL query:
SELECT
N'�' AS DiamondQuestionMark
, UNICODE(N'�') AS DiamondQuestionMarkUnicodeCodePoint
, N'?' QuestionMark
, UNICODE(N'?') AS QuestionMarkUnicodeCodePoint;
+---------------------+-------------------------------------+--------------+------------------------------+
| DiamondQuestionMark | DiamondQuestionMarkUnicodeCodePoint | QuestionMark | QuestionMarkUnicodeCodePoint |
+---------------------+-------------------------------------+--------------+------------------------------+
| � | 65533 | ? | 63 |
+---------------------+-------------------------------------+--------------+------------------------------+
ASCII only defines characters in the 0-127 code point range. With a non-Unicode literal specification of a non-ASCII character like '�'
, SQL Server maps the character to a code point in the 128-255 point range according the default database collation code page. When no mapping for the character exists in the code page, the character is either assigned a fallback character (e.g. "Ÿ" becomes "Y" when converted from code page 1252 to 850) or a "?" when no fallback character exists as in this case.
So, because "�" is not an ASCII character and does not exist in the default database code page, the single-byte literal '�'
becomes '?'
. Using Unicode literals N'�'
and Unicode data types nchar
and nvarchar
avoids this code page nastiness.
Upvotes: 2