UTF-8 encoding by characters bigger then UTF-8 upper range

Question

I'm working on a translation of uft-8 encoding code from C# into C. UFT8 covers the range of character values from 0x0000 to 0x7FFFFFFF (http://en.wikipedia.org/wiki/UTF-8).

Encoding function in C# file encodes for example the character 'ñ' without problem.

this character 'ñ' has the value FFFFFFF1 in hex in my sample program when I look it on memory window in VS 2005. But the character 'ñ' in Windows-Symbol-table has the hex value of 0xF1.

Now, in my sample program, I verify the characters in the string and find the highest range of UTF-8 to determin which Utf8 encoding range should be used for encoding.

Such:

"charToAnalyse" is here a character of a string::
{
char utfMode = 0;
char utf8EncoderMode = 0;

if(charToAnalyse >= 0x0000 && charToAnalyse <= 0x007F)    
{utfMode =1;}    
else if(charToAnalyse >= 0x0080 && charToAnalyse <= 0x07FF)
{utfMode =2;}
else if(charToAnalyse >= 0x0800 && charToAnalyse <= 0xFFFF)
{utfMode =3;}
else if(charToAnalyse >= 0x10000 && charToAnalyse <= 0x1FFFFF)
{utfMode =4;}
else if(charToAnalyse >= 0x200000 && charToAnalyse <= 0x3FFFFFF)
{utfMode =5;}
else if(charToAnalyse >= 0x4000000 && charToAnalyse <= 0x7FFFFFFF)
{utfMode =6;}

...
...
...

if(utfMode > utf8EncoderMode)
{
  utf8EncoderMode = utfMode;
}

in this function utfMode=0 for the character 'ñ', because ñ == 0xFFFFFFF1, and can not be classified with the codes above.

MY QUESTION HERE İS: 1) Is it true that ñ has the value of 0xFFFFFFF1? If 'yes' how cat it be classified for UTF8 encoding? Is it possible a character has a value bigger then U+7FFFFFFF (0x7FFFFFFF)? 2) Is this somehow related with the term of "low-surrogate" of "high-surrogate"?

Thanks a lot, even it's an absurd question :)

Wolfgang · Accepted Answer

I would like to explain this issue but Joni was the first :)

@Joni : You are perfectly right.

As I initiate the intager string as:

int charToAnalyseStr[50]= {'a', 0x7FFFFFFF, 'ñ', 'ş', 1};

the initiating of the e.g. this third member ñ occures as fallows:

giving member as 'ñ' understood by system as signed char (1byte).
'ñ' has a value of (-15) as signed char, this equals 241 as unsigned char!
So the value of (-15) is giving as an element of string by initiating.
the value of (-15) translated into signed intager normally as 0(dec) - 15(dec) = 0xFFFFFFF1 (hex)

the solution is here, what found is:

int charToAnalyseStr[50]= {(unsigned char)'a', 0x7FFFFFFF, (unsigned char)'ñ', 1};

So the charToAnalyseStr[2] appairs in memort window as 0x000000F1 :)

Thanks for your brain storm!

UTF-8 encoding by characters bigger then UTF-8 upper range

Answers (2)

Related Questions