user5528169
user5528169

Reputation:

Trying to detect encoding of Russian text-and read as string

I am received some Russian text over network. Here is dump of those bytes:

When I am trying to interpret this as ASCII string of course it doesn't work. Neither this seems to be a UTF8 encoding. Can someone help how to read these bytes in C# as string? (you can see debugger shows the letters next to them)

enter image description here

Upvotes: 0

Views: 3471

Answers (3)

StepUp
StepUp

Reputation: 38164

var input = "Привет, люди!";
var utf8bytes = Encoding.UTF8.GetBytes(input);
var win1251Bytes = Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding("windows-1251"), utf8bytes);
File.WriteAllBytes(@"foo.txt", win1251Bytes);

Upvotes: 0

Pyfhon
Pyfhon

Reputation: 280

In general, if you know where you get the text in most cases you have some information about the encoding, so you can simply use the class "Encoding", select the appropriate encoding and call the GetString

For example so Encoding.UTF8.GetString() or so Encoding.GetEncoding(1251).GetString()

If you do not have any information about encoding, then it is a different task, you have to look for some algorithm for encoding detection

Upvotes: 1

Kvam
Kvam

Reputation: 2218

Looks like cyrillic, codepage 1251.

var bytes = new byte[]
{
    210, 240, 224, 237, 231, 224, 234, 246, 232, 255, 32, 237, 229, 32, 236, 238, 230, 229, 242, 32, 225, 251, 242
};
var text = System.Text.Encoding.GetEncoding(1251).GetString(bytes);
// text = "Транзакция не может быт"

Not sure if there's a better way to figure it out than looping over the available codepages and see what looks looks right:

for (var i = 1; i < 100000; ++i)
{
    try
    {
        Console.WriteLine(System.Text.Encoding.GetEncoding(i).GetString(bytes));
        Console.WriteLine("Encoding: {0}", i);
        Console.WriteLine(System.Text.Encoding.GetEncoding(i).EncodingName);
        Console.WriteLine();
    }
    catch
    {
    }
}

Upvotes: 1

Related Questions