What is alternative for MultibyteToWideChar and WideCharToMultiByte functions in .NET?

Question

I am trying to migrate a code from VC++ to .net. VC++ code uses MultibyteToWideChar and WideCharToMultiByte functions provided by WinAPI. I tried using System.Text.Encoding class in .NET but it is not working for all encodings. Is there any other way to do this conversion? What is wrong in below code snippet?

Here is my C# code:

public static string MultiByteToWideChar(string input, int codepage)
    {
        Encoding e1 = Encoding.GetEncoding(codepage);
        Encoding e2 = Encoding.Unicode;

        //byte[] source = e1.GetBytes(input);

        byte[] source = MBCSToByte(input);

        byte[] target = Encoding.Convert(e1, e2, source);

        return e2.GetString(target);
    }
public static string WideCharToMultiByte(string input, int codepage)
    {
        Encoding e1 = Encoding.Unicode;
        Encoding e2 = Encoding.GetEncoding(codepage);

        byte[] source = e1.GetBytes(input);

        byte[] target = Encoding.Convert(e1, e2, source);

        return Encoding.GetEncoding(codepage).GetString(target);

    }
private static byte[] MBCSToByte(string s)
    {
        byte[] b = new byte[s.Length];
        int i = 0;
        foreach (char c in s)
            b[i++] = (byte)c;
        return b;
    }

MultiByteToWideChar is working only for codepage 1255 and not for 866

WideCharToMultiByte is not working for codepage 1251.

Remy Lebeau · Accepted Answer

MultiByteToWideChar() converts encoded bytes (NOT characters!) to Unicode characters.

WideCharToMultiByte() converts Unicode characters to encoded bytes (NOT characters!).

In .NET, the string type is always a sequence of Unicode characters (in UTF-16 byte encoding). So using string to hold encoded bytes is just plain wrong.

In your MultiByteToWideChar() function, you are assuming that the input string contains Unicode characters that are 16-bit representations of codepage-encoded 8-bit bytes. You are translating the Unicode characters as-is to a byte[] array, then converting that assumingly codepage-encoded array to a UTF-16 byte[] array, and then you are converting that to a UTF-16 string. This will work fine if and only if the initial assumption is true to begin with. Which is usually not the case, unless your input was corrupted to begin with.

In your WideCharToMultiByte() function, you are converting the input string to a UTF-16 byte[] array, then converting that array to a codepage-encoded byte[] array. So far so good (though you could just use Encoding.GetBytes() to go from the UTF-16 string directly to the codepage-encoded byte[] without using Encoding.Convert() at all). But then you are using the same codepage to convert the codepage-encoded byte[] array back to a UTF-16 string, thus un-doing everything you had done. The output string will be the same value as the input string (provided the specified codepage supports all of the Unicode characters in the input string, otherwise you will have data loss during the first codepage conversion).

That being said, the correct code should look more like this instead:

public static string MultiByteToWideChar(byte[] input, int codepage)
    {
        return Encoding.GetEncoding(codepage).GetString(input);
    }
public static byte[] WideCharToMultiByte(string input, int codepage)
    {
        return Encoding.GetEncoding(codepage).GetBytes(input);
    }

Don't use a string to hold encoded bytes, use an actual byte[] array instead.

What is alternative for MultibyteToWideChar and WideCharToMultiByte functions in .NET?

Answers (2)

Related Questions