Reputation: 224
In Oracle SQL:
When we pass any non-ascii characters to ascii function, it is returning some number. How we can interpret that number. If we have character set as AL32UTF8, why its not returning Unicode point for given character
select * from nls_database_parameters where parameter = 'NLS_CHARACTERSET';--AL32UTF8
select ascii('Ć') from dual;--50049
what is the meaning of this value 50049? I was expecting 193
Upvotes: 0
Views: 4650
Reputation:
Here - mostly for my own education - is an explanation of the value 50049 for the accented A (code point: 193 in the Unicode coded character set). I just learned how this works, from reading https://www.fileformat.info/info/unicode/utf8.htm Another pretty clear explanation and an example for a three-byte encoding on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Encoding
The computation of the encoding is derived from the code point 193 and has nothing to do with which specific character is associated with that code point.
UTF-8 uses a relatively simple scheme to encode code points up to 1,141,111 (or, likely, more these days; for now let's only worry about code points up to that upper limit).
Code points from 1 to 127 (decimal) are encoded as a single byte, equal to the code point (so the byte always has a leading zero when represented in binary).
Code points from 128 to 2047 are encoded as two bytes. The first byte, in binary representation, always begins with 110, and the second with 10. A byte that begins with 110 is always the first byte of a two-byte encoding, and a byte that begins with 10 is always a "continuation" byte (second, third or fourth) in a multi-byte encoding. These mandatory prefixes are part of the encoding scheme (the "rules" of UTF8 encoding); they are hard-coded values in the rules.
So: for code points from 128 to 2047, the encoding is in two bytes, of the exact format 110xxxxx and 10xxxxxx in binary notation. The last five digits (bits) from the first byte, plus the last six digits from the second (total: 11 bits) are the binary representation of the code point (the value from 128 to 2047 that must be encoded).
2047 = 2^11 - 1 (this is why 2047 is relevant here). The code point can be represented as an 11-bit binary number (possibly with leading zeros). Take the first five bits (after left-padding with 0 to a length of 11 bits) and attach that to the mandatory 110 prefix of the first byte, and take the last six bits of the code point and attach them to the mandatory prefix 10 of the second byte. That gives the UTF8 encoding (in two bytes) of the given code point.
Let's do that for code point 193(decimal). In binary, and padding with 0 to the left, that is 00011000001. So far, nothing fancy.
Split this into five bits || six bits: 00011 and 000001.
Attach the mandatory prefixes: 11000011 and 10000001.
Rewrite these in hex: \xC3 and \x81. Put them together; this is hex C381, or decimal 50049.
Upvotes: 2
Reputation: 59456
See documentation: ASCII
ASCII
returns the decimal representation in the database character set of the first character of char.
Binary value of character Ć
(U+00C1) in UTF-8 is xC381
which is decimal 50049.
193 is the Code Point. For UTF-8 the code point is equal to binary representation only for characters U+0000 - U+007F (0-127) . For UTF-16BE the code point is equal to binary representation only for characters U+0000 - U+FFFF (0-65535),
Maybe you are looking for
ASCIISTR('Ć')
which returns \00C1
, you only have to convert it to a decimal value.
Some time ago I developed this function, which is more advanced than ASCIISTR
, it works also work multicode characters.
CREATE OR REPLACE TYPE VARCHAR_TABLE_TYPE AS TABLE OF VARCHAR2(100);
FUNCTION UNICODECHAR(uchar VARCHAR2) RETURN VARCHAR_TABLE_TYPE IS
UTF16 VARCHAR2(32000) := ASCIISTR(uchar);
UTF16_Table VARCHAR_TABLE_TYPE := VARCHAR_TABLE_TYPE();
sg1 VARCHAR2(4);
sg2 VARCHAR2(4);
codepoint INTEGER;
res VARCHAR_TABLE_TYPE := VARCHAR_TABLE_TYPE();
i INTEGER;
BEGIN
IF uchar IS NULL THEN
RETURN VARCHAR_TABLE_TYPE();
END IF;
SELECT REGEXP_SUBSTR(UTF16, '(\\[[:xdigit:]]{4})|.', 1, LEVEL)
BULK COLLECT INTO UTF16_Table
FROM dual
CONNECT BY REGEXP_SUBSTR(UTF16, '\\[[:xdigit:]]{4}|.', 1, LEVEL) IS NOT NULL;
i := UTF16_Table.FIRST;
WHILE i IS NOT NULL LOOP
res.EXTEND;
IF REGEXP_LIKE(UTF16_Table(i), '^\\') THEN
IF REGEXP_LIKE(UTF16_Table(i), '^\\D(8|9|A|B)') THEN
sg1 := REGEXP_SUBSTR(UTF16_Table(i), '[[:xdigit:]]{4}');
i := UTF16_Table.NEXT(i);
sg2 := REGEXP_SUBSTR(UTF16_Table(i), '[[:xdigit:]]{4}');
codepoint := 2**10 * (TO_NUMBER(sg1, 'XXXX') - TO_NUMBER('D800', 'XXXX')) + TO_NUMBER(sg2, 'XXXX') - TO_NUMBER('DC00', 'XXXX') + 2**16;
res(res.LAST) := 'U+'||TO_CHAR(codepoint, 'fmXXXXXX');
ELSE
res(res.LAST) := 'U+'||REGEXP_REPLACE(UTF16_Table(i), '^\\');
END IF;
ELSE
res(res.LAST) := 'U+'||LPAD(TO_CHAR(ASCII(UTF16_Table(i)), 'fmXX'), 4, '0');
END IF;
i := UTF16_Table.NEXT(i);
END LOOP;
RETURN res;
END UNICODECHAR;
Try some examples from https://unicode.org/emoji/charts/full-emoji-list.html#1f3f3_fe0f_200d_1f308
UNICODECHAR('š“āā ļø')
should return
U+1F3F4
U+200D
U+2620
U+FE0F
Upvotes: 1