willvv
willvv

Reputation: 8649

C#: Convert Japanese text encoding in shift-JIS and stored as ASCII into UTF-8

I am trying to convert an old application that has some strings stored in the database as ASCII.

For example, the string: ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð is stored in the database.

Now, if I copy that string in a text editor and save it as ASCII and then open the file in a web browser and set it to automatically detect the Encoding, I get the correct string in japanese: チャネルパートナーの選択, and the page says that the detected encoding is Japanese (Shift_JIS).

When I try to do the conversion in the C# code doing something like this:

var asciiBytes = Encoding.ASCII.GetBytes(text);
var japaneseEncoding = Encoding.GetEncoding(932);
var convertedBytes = Encoding.Convert(japaneseEncoding, Encoding.ASCII, asciiBytes);
var japaneseString = japaneseEncoding.GetString(convertedBytes);

I get ?`???l???p?[?g?i?[???I?? as the japanese String and thus I cannot show it on the webpage.

Any light would be appreciated.

Thanks

Upvotes: 6

Views: 20224

Answers (3)

VoteCoffee
VoteCoffee

Reputation: 5107

This code will dump a bunch of different options out so you can see what's close. I use this a lot for comments in old applications that don't have any encoding awareness.

You can copy-paste to run it online here: https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.getencodings?view=netframework-4.8#System_Text_Encoding_GetEncodings

using System;

public class Program
{
    public static void Main()
    {
        var badstringFromDatabase = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
        var recovered1 = System.Text.Encoding.GetEncoding(932).GetBytes(badstringFromDatabase); //Shift JIS
        var recovered2 = System.Text.Encoding.GetEncoding(20932).GetBytes(badstringFromDatabase); //EUC
        var recovered3 = System.Text.Encoding.GetEncoding(51932).GetBytes(badstringFromDatabase); //EUC
        var recovered4 = System.Text.Encoding.GetEncoding(50220).GetBytes(badstringFromDatabase); //ISO-2022-JP
        var recovered5 = System.Text.Encoding.GetEncoding(50221).GetBytes(badstringFromDatabase); //ISO-2022-JP
        var recovered6 = System.Text.Encoding.GetEncoding(50222).GetBytes(badstringFromDatabase); //ISO-2022-JP
        var recovered7 = System.Text.Encoding.GetEncoding(65001).GetBytes(badstringFromDatabase); //UTF-8
        var recovered8 = System.Text.Encoding.GetEncoding(1200).GetBytes(badstringFromDatabase); //UTF-16
        var recovered9 = System.Text.Encoding.GetEncoding(12000).GetBytes(badstringFromDatabase); //UTF-32
        var recovered10 = System.Text.Encoding.GetEncoding(12001).GetBytes(badstringFromDatabase); //UTF-32BE
        var recovered11 = System.Text.Encoding.GetEncoding(65000).GetBytes(badstringFromDatabase); //UTF-7
        Console.WriteLine("Shift JIS: " + System.Text.Encoding.GetEncoding(932).GetString(recovered1)); //Shift JIS
        Console.WriteLine("EUC: " + System.Text.Encoding.GetEncoding(932).GetString(recovered2)); //EUC
        Console.WriteLine("EUC: " + System.Text.Encoding.GetEncoding(932).GetString(recovered3)); //EUC
        Console.WriteLine("ISO-2022-JP: " + System.Text.Encoding.GetEncoding(932).GetString(recovered4)); //ISO-2022-JP
        Console.WriteLine("ISO-2022-JP: " + System.Text.Encoding.GetEncoding(932).GetString(recovered5)); //ISO-2022-JP
        Console.WriteLine("ISO-2022-JP: " + System.Text.Encoding.GetEncoding(932).GetString(recovered6)); //ISO-2022-JP
        Console.WriteLine("UTF-8: " + System.Text.Encoding.GetEncoding(932).GetString(recovered7)); //UTF-8
        Console.WriteLine("UTF-16: " + System.Text.Encoding.GetEncoding(932).GetString(recovered8)); //UTF-16
        Console.WriteLine("UTF-32: " + System.Text.Encoding.GetEncoding(932).GetString(recovered9)); //UTF-32
        Console.WriteLine("UTF-32BE: " + System.Text.Encoding.GetEncoding(932).GetString(recovered10)); //UTF-32BE
        Console.WriteLine("UTF-7: " + System.Text.Encoding.GetEncoding(932).GetString(recovered11)); //UTF-7
    }
}

Upvotes: 0

Matt Mitchell
Matt Mitchell

Reputation: 41843

As per the other answer, I'm pretty sure you're using ANSI/Default encoding not ASCII.

The following examples seem to get you what you're after.

var japaneseEncoding = Encoding.GetEncoding(932);

// From file bytes
var fileBytes = File.ReadAllBytes(@"C:\temp\test.html");
var japaneseTextFromFile = japaneseEncoding.GetString(fileBytes);
japaneseTextFromFile.Dump();

// From string bytes
var textString = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var textBytes = Encoding.Default.GetBytes(textString);
var japaneseTextFromString = japaneseEncoding.GetString(textBytes);
japaneseTextFromString.Dump();

Interestingly I think I need to read up on Encoding.Convert as it did not produce the behaviour I expected. The GetString methods seem to only work if I pass in bytes read in the Encoding.Default format - if I convert to the Japanese encoding beforehand they do not work as expected.

Upvotes: 3

Hans Passant
Hans Passant

Reputation: 942030

some strings stored in the database as ASCII

It isn't ASCII, about none of the characters in ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð are ASCII. Encoding.ASCII.GetBytes(text) is going to produce a lot of huh? characters, that's why you got all those question marks.

The core issue is that the bytes in the dbase column were read with the wrong encoding. You used code page 1252:

var badstringFromDatabase = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var hopefullyRecovered = Encoding.GetEncoding(1252).GetBytes(badstringFromDatabase);
var oughtToBeJapanese = Encoding.GetEncoding(932).GetString(hopefullyRecovered);

Which produces "チャネルパートナーの選択"

This is not going to be completely reliable, code page 1252 has a few unassigned codes that are used in 932. You'll end up with a garbled string from which you cannot recover the original byte value anymore. You'll need to focus on getting the data provider to use the correct encoding.

Upvotes: 11

Related Questions