Wolfgang
Wolfgang

Reputation: 91

UTF-8 encoding by characters bigger then UTF-8 upper range

I'm working on a translation of uft-8 encoding code from C# into C. UFT8 covers the range of character values from 0x0000 to 0x7FFFFFFF (http://en.wikipedia.org/wiki/UTF-8).

Encoding function in C# file encodes for example the character 'ñ' without problem.

this character 'ñ' has the value FFFFFFF1 in hex in my sample program when I look it on memory window in VS 2005. But the character 'ñ' in Windows-Symbol-table has the hex value of 0xF1.

Now, in my sample program, I verify the characters in the string and find the highest range of UTF-8 to determin which Utf8 encoding range should be used for encoding.

Such:

"charToAnalyse" is here a character of a string::
{
char utfMode = 0;
char utf8EncoderMode = 0;

if(charToAnalyse >= 0x0000 && charToAnalyse <= 0x007F)    
{utfMode =1;}    
else if(charToAnalyse >= 0x0080 && charToAnalyse <= 0x07FF)
{utfMode =2;}
else if(charToAnalyse >= 0x0800 && charToAnalyse <= 0xFFFF)
{utfMode =3;}
else if(charToAnalyse >= 0x10000 && charToAnalyse <= 0x1FFFFF)
{utfMode =4;}
else if(charToAnalyse >= 0x200000 && charToAnalyse <= 0x3FFFFFF)
{utfMode =5;}
else if(charToAnalyse >= 0x4000000 && charToAnalyse <= 0x7FFFFFFF)
{utfMode =6;}

...
...
...

if(utfMode > utf8EncoderMode)
{
  utf8EncoderMode = utfMode;
}

in this function utfMode=0 for the character 'ñ', because ñ == 0xFFFFFFF1, and can not be classified with the codes above.

MY QUESTION HERE İS: 1) Is it true that ñ has the value of 0xFFFFFFF1? If 'yes' how cat it be classified for UTF8 encoding? Is it possible a character has a value bigger then U+7FFFFFFF (0x7FFFFFFF)? 2) Is this somehow related with the term of "low-surrogate" of "high-surrogate"?

Thanks a lot, even it's an absurd question :)

Upvotes: 0

Views: 1001

Answers (2)

Wolfgang
Wolfgang

Reputation: 91

I would like to explain this issue but Joni was the first :)

@Joni : You are perfectly right.

As I initiate the intager string as:

int charToAnalyseStr[50]= {'a', 0x7FFFFFFF, 'ñ', 'ş', 1};

the initiating of the e.g. this third member ñ occures as fallows:

  1. giving member as 'ñ' understood by system as signed char (1byte).

  2. 'ñ' has a value of (-15) as signed char, this equals 241 as unsigned char!

  3. So the value of (-15) is giving as an element of string by initiating.

  4. the value of (-15) translated into signed intager normally as 0(dec) - 15(dec) = 0xFFFFFFF1 (hex)

the solution is here, what found is:

int charToAnalyseStr[50]= {(unsigned char)'a', 0x7FFFFFFF, (unsigned char)'ñ', 1};

So the charToAnalyseStr[2] appairs in memort window as 0x000000F1 :)

Thanks for your brain storm!

Upvotes: 0

hexwab
hexwab

Reputation: 1841

It sounds very much as though you're reading signed bytes (is your input in ISO 8859-1 perchance?): your bytes are being interpreted as being in the range -128..127 rather than 0..255, and your value that should be 0xf1 (241) is being read as -15 instead, which is 0xfffffff1 in twos-complement. In C, "char" is often signed by default[1]; you should be using "unsigned char".

Unicode does not go as far up as 0xfffffff1, which is why UTF-8 does not provide an encoding for such code points.

[1] To be precise, "char" is distinct from both "signed char" and "unsigned char". But it can behave as either unsigned or signed, and which you get is implementation-defined.

Upvotes: 1

Related Questions