Fire Hand
Fire Hand

Reputation: 26416

Converting non-Unicode to Unicode

I'm trying to convert a non-Unicode string like this, '¹ûº¤¡¾­¢º¤ìñ©2' to Unicode like this, 'ໃຊ້ໃນຄົວເຮືອນ' which is in Lao. I tried with the code below and its return value is like this, '??????'. Any idea how can I convert the string?

Public Shared Function ConvertAsciiToUnicode(asciiString As String) As String
    ' Create two different encodings.
    Dim encAscii As Encoding = Encoding.ASCII
    Dim encUnicode As Encoding = Encoding.Unicode

    ' Convert the string into a byte[].
    Dim asciiBytes As Byte() = encAscii.GetBytes(asciiString)

    ' Perform the conversion from one encoding to the other.
    Dim unicodeBytes As Byte() = Encoding.Convert(encAscii, encUnicode, asciiBytes)

    ' Convert the new byte[] into a char[] and then into a string.
    ' This is a slightly different approach to converting to illustrate
    ' the use of GetCharCount/GetChars.
    Dim unicodeChars As Char() = New Char(encUnicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length) - 1) {}
    encUnicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, unicodeChars, 0)
    Dim unicodeString As New String(unicodeChars)

    ' Return the new unicode string
    Return unicodeString
End Function

Upvotes: 0

Views: 11849

Answers (1)

Jirka Hanika
Jirka Hanika

Reputation: 13547

Your 8-bit encoded Lao text is not in ASCII, but in some codepage like IBM CP1133 or Microsoft LC0454, or most likely, the Thai codepage 874. You have to find out which one it is.

It matters how you have obtained (read, received, computed) the input string. By the time you make it a string it is already in Unicode and is easy to output in UTF-8, for example, like this:

Dim writer As New StreamWriter("myfile.txt", True, System.Text.Encoding.UTF8)
writer.Write(mystring)
writer.Close()

Here is the whole in-memory conversion:

Dim utf8_input as Byte()
...
Dim converted as Byte() = Encoding.Convert(Encoding.GetEncoding(874), Encoding.UTF8, utf8_input)

The number 874 is the number that says in which codepage your input is. Whether a particular operating system installation supports this codepage, is another question, but your own system will nearly certainly support it if you just used it to compose your Stack Overflow question.

Upvotes: 4

Related Questions