Sasha
Sasha

Reputation:

Conversion of a unicode character from byte

In our API, we use byte[] to send over data across the network. Everything worked fine, until the day our "foreign" clients decided to pass/receive Unicode characters.

As far as I know, Unicode characters occupy 2 bytes, however, we only allocate 1 byte in the byte array for them.

Here is how we read the character from the byte[] array:

       // buffer is a byte[6553] and index is a current location in the buffer
        char c = System.BitConverter.ToChar(buffer, m_index);
        index += SIZEOF_BYTE;

        return c;

So the current issue is the API is receiving a strange Unicode character, when I look at the Unicode hexadecimal. I found that the last significant byte is correct but the most significant byte has a value when it’s supposed to be 0. A quick workaround, thus far, has been to 0x00FF & c to filter the msb.

Please suggest the correct approach to deal with Unicode characters coming from the socket?

Thanks.

Solution:

Kudos to Jon:

char c = (char) buffer[m_index];

And as he mentioned, the reason it works, is because the client api receives a character occupying only one byte, and BitConverter.ToChar uses two, hence the issue in converting it. I am still startled as to why it worked for some set of characters and not the others, as it should have failed in all cases.

Thanks Guys, great responses!

Upvotes: 5

Views: 5808

Answers (7)

CodingBarfield
CodingBarfield

Reputation: 3398

My only solution is to fix the API. Either tell the users to use only ASCII string in the Byte[] or fix it to support ASCII and any other encoding you need to use.

Deciding what encoding is supplied by the foreign clients from just the byte[] can be a bit tricky.

Upvotes: 0

Yuliy
Yuliy

Reputation: 17728

What encoding are your customers using? If some of your customers are still using ASCII, then you'll need your international customers to use something which maps the ASCII set (1-127) to itself, such as UTF8. After that, use the UTF8 encoding's GetString method.

Upvotes: 0

Jon Skeet
Jon Skeet

Reputation: 1503559

You should use Encoding.GetString, using the most appropriate encoding.

I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.

Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?

EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:

char c = (char) buffer[m_index];

I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.

Upvotes: 6

JaredPar
JaredPar

Reputation: 755457

It's unclear what exactly your goal is here. From what I can tell, there are 2 routes that you can take

  1. Ignore all data sent in Unicode
  2. Process both unicode and ASCII strings

IMHO, #1 is the way to go. But it sounds like your protocol is not necessarily setup to deal with a unicode string. You will have to do some detection logic to determine if the string coming in is a Unicode version. If it is you can use the Enconding.Unicode.GetString method to convert that particular byte array.

Upvotes: 0

Michael Meadows
Michael Meadows

Reputation: 28426

Test streams should contain a byte-order marker that will allow you to determine how to treat the binary data.

Upvotes: 0

Drew Noakes
Drew Noakes

Reputation: 311315

Unicode characters can take up to four bytes, but rarely are messages encoded on the wire using 4 bytes for each character. Rather, schemes like UTF8 or UTF16 are used that only bring in extra bytes when required.

Have a look at the Encoding class guidance.

Upvotes: 0

user67143
user67143

Reputation: 327

You should look at the System.Text.ASCIIEncoder.ASCII.GetString function which takes a byte[] array and converts it to a string (for ascii).

And System.Text.UTF8Encoder or System.Text.UTF16Encoder for Unicode strings in the UTF8 or UTF16 encodings.

There are also functions for converting Strings to Byte[] in the ASCIIEncoding, UTF8Encoding and UTF16Encoding classes: see the GetBytes(String) functions.

Upvotes: 0

Related Questions