Reputation: 91
I'm working on a translation of uft-8 encoding code from C# into C. UFT8 covers the range of character values from 0x0000 to 0x7FFFFFFF (http://en.wikipedia.org/wiki/UTF-8).
Encoding function in C# file encodes for example the character 'ñ' without problem.
this character 'ñ' has the value FFFFFFF1 in hex in my sample program when I look it on memory window in VS 2005. But the character 'ñ' in Windows-Symbol-table has the hex value of 0xF1.
Now, in my sample program, I verify the characters in the string and find the highest range of UTF-8 to determin which Utf8 encoding range should be used for encoding.
Such:
"charToAnalyse" is here a character of a string::
{
char utfMode = 0;
char utf8EncoderMode = 0;
if(charToAnalyse >= 0x0000 && charToAnalyse <= 0x007F)
{utfMode =1;}
else if(charToAnalyse >= 0x0080 && charToAnalyse <= 0x07FF)
{utfMode =2;}
else if(charToAnalyse >= 0x0800 && charToAnalyse <= 0xFFFF)
{utfMode =3;}
else if(charToAnalyse >= 0x10000 && charToAnalyse <= 0x1FFFFF)
{utfMode =4;}
else if(charToAnalyse >= 0x200000 && charToAnalyse <= 0x3FFFFFF)
{utfMode =5;}
else if(charToAnalyse >= 0x4000000 && charToAnalyse <= 0x7FFFFFFF)
{utfMode =6;}
...
...
...
if(utfMode > utf8EncoderMode)
{
utf8EncoderMode = utfMode;
}
in this function utfMode=0 for the character 'ñ', because ñ == 0xFFFFFFF1, and can not be classified with the codes above.
MY QUESTION HERE İS: 1) Is it true that ñ has the value of 0xFFFFFFF1? If 'yes' how cat it be classified for UTF8 encoding? Is it possible a character has a value bigger then U+7FFFFFFF (0x7FFFFFFF)? 2) Is this somehow related with the term of "low-surrogate" of "high-surrogate"?
Thanks a lot, even it's an absurd question :)
Upvotes: 0
Views: 1001
Reputation: 91
I would like to explain this issue but Joni was the first :)
@Joni : You are perfectly right.
As I initiate the intager string as:
int charToAnalyseStr[50]= {'a', 0x7FFFFFFF, 'ñ', 'ş', 1};
the initiating of the e.g. this third member ñ occures as fallows:
giving member as 'ñ' understood by system as signed char (1byte).
'ñ' has a value of (-15) as signed char, this equals 241 as unsigned char!
So the value of (-15) is giving as an element of string by initiating.
the value of (-15) translated into signed intager normally as 0(dec) - 15(dec) = 0xFFFFFFF1 (hex)
the solution is here, what found is:
int charToAnalyseStr[50]= {(unsigned char)'a', 0x7FFFFFFF, (unsigned char)'ñ', 1};
So the charToAnalyseStr[2] appairs in memort window as 0x000000F1 :)
Thanks for your brain storm!
Upvotes: 0
Reputation: 1841
It sounds very much as though you're reading signed bytes (is your input in ISO 8859-1 perchance?): your bytes are being interpreted as being in the range -128..127 rather than 0..255, and your value that should be 0xf1 (241) is being read as -15 instead, which is 0xfffffff1 in twos-complement. In C, "char" is often signed by default[1]; you should be using "unsigned char".
Unicode does not go as far up as 0xfffffff1, which is why UTF-8 does not provide an encoding for such code points.
[1] To be precise, "char" is distinct from both "signed char" and "unsigned char". But it can behave as either unsigned or signed, and which you get is implementation-defined.
Upvotes: 1