user963241
user963241

Reputation: 7038

TCP receiving extended ASCII or utf-8 characters

For inverted question mark ¿ I receive two bytes [-62][-65] but how would i get readable utf-8 or ASCII character encoding?

Upvotes: 0

Views: 1594

Answers (3)

paxdiablo
paxdiablo

Reputation: 881383

That is the UTF8 code for that character. The inverted question mark is Unicode code point 191 which, in UTF8, is 0xc2:0xbf.

You're seeing them as signed bytes. For example -62 signed is 256-62 or 194 unsigned - that's hex 0xc2.

Similarly, -65 signed is 256-65 or 191 unsigned - that's hex 0xbf.

If you want to convert your UTF8 sequence into a code point, you can use the table below.

    Range              Encoding  Binary value
    -----------------  --------  --------------------------
    U+000000-U+00007f  0xxxxxxx  0xxxxxxx

    U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                       10xxxxxx

    U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                       10yyyyxx
                       10xxxxxx

    U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                       10zzyyyy
                       10yyyyxx
                       10xxxxxx

For example, your 0xc2:0xbf is binary 11000010 10111111 which matches the second case:

      11000010 10111111
         |||||   ||||||
         |||\\  //////
         ||| ||||||||
    00000000 10111111  ->  0x00bf  ->  191

Upvotes: 4

unwind
unwind

Reputation: 399803

Look at the byte values in hexadecimal:

  • -62 is 0xc2
  • -65 is 0xbf

If you look up the Unicode information for the glyph in question, you can see that this is, inded, the two bytes that make up the UTF-8 encoding of the inverted question mark glyph.

Upvotes: 1

Henk Holterman
Henk Holterman

Reputation: 273229

Those 2 bytes probably are UTF-8

For ASCII you would need a specific codepage.

And what exactly is a 'readable' char encoding?

Upvotes: 1

Related Questions