sho
sho

Reputation: 224

ascii function in oracle sql, for non-ascii value

In Oracle SQL:

When we pass any non-ascii characters to ascii function, it is returning some number. How we can interpret that number. If we have character set as AL32UTF8, why its not returning Unicode point for given character

select * from nls_database_parameters where parameter = 'NLS_CHARACTERSET';--AL32UTF8

select ascii('Ɓ') from dual;--50049

what is the meaning of this value 50049? I was expecting 193

Upvotes: 0

Views: 4650

Answers (2)

user5683823
user5683823

Reputation:

Here - mostly for my own education - is an explanation of the value 50049 for the accented A (code point: 193 in the Unicode coded character set). I just learned how this works, from reading https://www.fileformat.info/info/unicode/utf8.htm Another pretty clear explanation and an example for a three-byte encoding on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Encoding

The computation of the encoding is derived from the code point 193 and has nothing to do with which specific character is associated with that code point.

UTF-8 uses a relatively simple scheme to encode code points up to 1,141,111 (or, likely, more these days; for now let's only worry about code points up to that upper limit).

Code points from 1 to 127 (decimal) are encoded as a single byte, equal to the code point (so the byte always has a leading zero when represented in binary).

Code points from 128 to 2047 are encoded as two bytes. The first byte, in binary representation, always begins with 110, and the second with 10. A byte that begins with 110 is always the first byte of a two-byte encoding, and a byte that begins with 10 is always a "continuation" byte (second, third or fourth) in a multi-byte encoding. These mandatory prefixes are part of the encoding scheme (the "rules" of UTF8 encoding); they are hard-coded values in the rules.

So: for code points from 128 to 2047, the encoding is in two bytes, of the exact format 110xxxxx and 10xxxxxx in binary notation. The last five digits (bits) from the first byte, plus the last six digits from the second (total: 11 bits) are the binary representation of the code point (the value from 128 to 2047 that must be encoded).

2047 = 2^11 - 1 (this is why 2047 is relevant here). The code point can be represented as an 11-bit binary number (possibly with leading zeros). Take the first five bits (after left-padding with 0 to a length of 11 bits) and attach that to the mandatory 110 prefix of the first byte, and take the last six bits of the code point and attach them to the mandatory prefix 10 of the second byte. That gives the UTF8 encoding (in two bytes) of the given code point.

Let's do that for code point 193(decimal). In binary, and padding with 0 to the left, that is 00011000001. So far, nothing fancy.

Split this into five bits || six bits: 00011 and 000001.

Attach the mandatory prefixes: 11000011 and 10000001.

Rewrite these in hex: \xC3 and \x81. Put them together; this is hex C381, or decimal 50049.

Upvotes: 2

Wernfried Domscheit
Wernfried Domscheit

Reputation: 59456

See documentation: ASCII

ASCII returns the decimal representation in the database character set of the first character of char.

Binary value of character Ɓ (U+00C1) in UTF-8 is xC381 which is decimal 50049.

193 is the Code Point. For UTF-8 the code point is equal to binary representation only for characters U+0000 - U+007F (0-127) . For UTF-16BE the code point is equal to binary representation only for characters U+0000 - U+FFFF (0-65535),

Maybe you are looking for

ASCIISTR('Ɓ')

which returns \00C1, you only have to convert it to a decimal value.

Some time ago I developed this function, which is more advanced than ASCIISTR, it works also work multicode characters.

CREATE OR REPLACE TYPE VARCHAR_TABLE_TYPE AS TABLE OF VARCHAR2(100);


FUNCTION UNICODECHAR(uchar VARCHAR2) RETURN VARCHAR_TABLE_TYPE IS

    UTF16 VARCHAR2(32000) := ASCIISTR(uchar);
    UTF16_Table VARCHAR_TABLE_TYPE := VARCHAR_TABLE_TYPE();
    sg1 VARCHAR2(4);
    sg2 VARCHAR2(4);
    codepoint INTEGER;

    res VARCHAR_TABLE_TYPE := VARCHAR_TABLE_TYPE();
    i INTEGER;
BEGIN
 
    IF uchar IS NULL THEN
        RETURN VARCHAR_TABLE_TYPE();
    END IF;
    
    SELECT REGEXP_SUBSTR(UTF16, '(\\[[:xdigit:]]{4})|.', 1, LEVEL)
    BULK COLLECT INTO UTF16_Table  
    FROM dual
    CONNECT BY REGEXP_SUBSTR(UTF16, '\\[[:xdigit:]]{4}|.', 1, LEVEL) IS NOT NULL;
    
    i := UTF16_Table.FIRST;
    WHILE i IS NOT NULL LOOP
        res.EXTEND;
        IF REGEXP_LIKE(UTF16_Table(i), '^\\') THEN
            IF REGEXP_LIKE(UTF16_Table(i), '^\\D(8|9|A|B)') THEN
                sg1 := REGEXP_SUBSTR(UTF16_Table(i), '[[:xdigit:]]{4}');
                i := UTF16_Table.NEXT(i);
                sg2 := REGEXP_SUBSTR(UTF16_Table(i), '[[:xdigit:]]{4}');
                codepoint := 2**10 * (TO_NUMBER(sg1, 'XXXX') - TO_NUMBER('D800', 'XXXX')) + TO_NUMBER(sg2, 'XXXX') - TO_NUMBER('DC00', 'XXXX') + 2**16;
                res(res.LAST) := 'U+'||TO_CHAR(codepoint, 'fmXXXXXX');
            ELSE
                res(res.LAST) := 'U+'||REGEXP_REPLACE(UTF16_Table(i), '^\\');
            END IF; 
        ELSE
            res(res.LAST) := 'U+'||LPAD(TO_CHAR(ASCII(UTF16_Table(i)), 'fmXX'), 4, '0');
        END IF;     
        i := UTF16_Table.NEXT(i);
    END LOOP;
    RETURN res;

END UNICODECHAR;

Try some examples from https://unicode.org/emoji/charts/full-emoji-list.html#1f3f3_fe0f_200d_1f308

UNICODECHAR('šŸ“ā€ā˜ ļø')

should return

U+1F3F4 
U+200D 
U+2620 
U+FE0F

Upvotes: 1

Related Questions