ascii function in oracle sql, for non-ascii value

Question

In Oracle SQL:

When we pass any non-ascii characters to ascii function, it is returning some number. How we can interpret that number. If we have character set as AL32UTF8, why its not returning Unicode point for given character

select * from nls_database_parameters where parameter = 'NLS_CHARACTERSET';--AL32UTF8

select ascii('Á') from dual;--50049

what is the meaning of this value 50049? I was expecting 193

user5683823 · Accepted Answer

Here - mostly for my own education - is an explanation of the value 50049 for the accented A (code point: 193 in the Unicode coded character set). I just learned how this works, from reading https://www.fileformat.info/info/unicode/utf8.htm Another pretty clear explanation and an example for a three-byte encoding on Wikipedia: https://en.wikipedia.org/wiki/UTF-8#Encoding

The computation of the encoding is derived from the code point 193 and has nothing to do with which specific character is associated with that code point.

UTF-8 uses a relatively simple scheme to encode code points up to 1,141,111 (or, likely, more these days; for now let's only worry about code points up to that upper limit).

Code points from 1 to 127 (decimal) are encoded as a single byte, equal to the code point (so the byte always has a leading zero when represented in binary).

Code points from 128 to 2047 are encoded as two bytes. The first byte, in binary representation, always begins with 110, and the second with 10. A byte that begins with 110 is always the first byte of a two-byte encoding, and a byte that begins with 10 is always a "continuation" byte (second, third or fourth) in a multi-byte encoding. These mandatory prefixes are part of the encoding scheme (the "rules" of UTF8 encoding); they are hard-coded values in the rules.

So: for code points from 128 to 2047, the encoding is in two bytes, of the exact format 110xxxxx and 10xxxxxx in binary notation. The last five digits (bits) from the first byte, plus the last six digits from the second (total: 11 bits) are the binary representation of the code point (the value from 128 to 2047 that must be encoded).

2047 = 2^11 - 1 (this is why 2047 is relevant here). The code point can be represented as an 11-bit binary number (possibly with leading zeros). Take the first five bits (after left-padding with 0 to a length of 11 bits) and attach that to the mandatory 110 prefix of the first byte, and take the last six bits of the code point and attach them to the mandatory prefix 10 of the second byte. That gives the UTF8 encoding (in two bytes) of the given code point.

Let's do that for code point 193(decimal). In binary, and padding with 0 to the left, that is 00011000001. So far, nothing fancy.

Split this into five bits || six bits: 00011 and 000001.

Attach the mandatory prefixes: 11000011 and 10000001.

Rewrite these in hex: \xC3 and \x81. Put them together; this is hex C381, or decimal 50049.

ascii function in oracle sql, for non-ascii value

Answers (2)

Related Questions