Kei
Kei

Reputation: 1026

Fixing a mis-encoded string after the fact

Main problem and question:

Given a garbled string for which the actual text is known, is it possible to consistently repair the garbled string?

According to Nyerguds comment on this answer:

If the string is an incorrect decoding done with a simply 8-bit Encoding and you have the Encoding used to decode it, you can usually get the bytes back without any corruption, though.

(emphases mine)

Which suggests that there are cases when it is not possible to derive the original bytes back. This leads me to the following question: are there cases when (mis)encoding an array of bytes is a lossy and irreversible operation?

Background: I am calling an external C++ library that calls a web API somewhere. Sometimes this library gives me slightly garbled text. In my C# project, I am trying to find a way to consistently reverse the miscoding, but I only seem to be able to do so part of the time.

What I've tried:

It seems clear that the C++ library is wrongly encoding the original bytes, which it later passes to me as a string. My approach has been to guess at the encoding that the C++ library used to interpret the original source bytes. Then, I iterate through all possible encodings, reinterpreting the hopefully "original" bytes with another encoding.

class TestCase
{
    public string Original { get; set; }
    public string Actual { get; set; }
    public List<string> Matches { get;} = new List<string>();
}

void Main()
{
    var testCases = new List<TestCase>()
    {
        new TestCase {Original = "窶弑-shaped", Actual = "“U-shaped"},
        new TestCase {Original = "窶廡窶・Type", Actual = "“F” Type"},
        new TestCase {Original = "Ko窶冩lau", Actual = "Ko’olau"},
        new TestCase {Original = "窶彗s is", Actual = "“as is"},
        new TestCase {Original = "窶從ew", Actual = "“new"},
        new TestCase {Original = "faテァade", Actual = "façade"}
    };

    var encodings = Encoding.GetEncodings().Select(x => x.GetEncoding()).ToList();
    foreach (var testCase in testCases)
    {
        foreach (var from in encodings)
        {
            foreach (var to in encodings)
            {
                // Guess the original bytes of the string
                var guessedSourceBytes = from.GetBytes(testCase.Original);
                // Guess what the bytes should have been interpreted as
                var guessedActualString = to.GetString(guessedSourceBytes);

                if (guessedActualString == testCase.Actual)
                {
                    testCase.Matches.Add($"Reversed using \"{from.CodePage} {from.EncodingName}\", reinterpreted as: \"{to.CodePage} {to.EncodingName}\"");
                }
            }
        }
    }
}

Results

As we can see above, out of the six test cases, all but one (窶廡窶・) was successful. In the successful cases, Shift-JIS (codepage 932) seemed to result in the correct "original" byte sequence for UTF8.

Getting the Shift-JIS bytes for 窶廡窶・ yields: E2 80 9C 46 E2 80 81 45. E2 80 9C coincides with the UTF8 bytes for left double quotation mark, which is correct. However, E2 80 81 is em quad in UTF8, not the right double quotation mark I am expecting. Reinterpreting the whole byte sequence in UTF8 results in “F EType

No matter which encoding I use to derive the "original" bytes, and no matter what encoding I use to reinterpret said bytes, no combination seems to be able to successfully convert 窶廡窶・ to “F”.

Interestingly if I derive the UTF8 bytes for “F” Type, and purposely misinterpret those bytes as Shift-JIS, I get back 窶廡窶・Type

Encoding.GetEncoding(932).GetString(Encoding.UTF8.GetBytes("“F” Type"))

This leads me to believe that encoding can actually lead to data loss. I'm not well versed on encoding though, so could someone confirm whether my conclusion is correct, and if so, why this data loss occurs?

Upvotes: 0

Views: 214

Answers (1)

Alexei Levenkov
Alexei Levenkov

Reputation: 100527

Yes, there are encodings that don't support all characters. One most common example is ASCIIEncoding that replaces all characters outside of standard ASCII range with ?.

...Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. … characters outside that range are replaced with a question mark (?) before the encoding operation is performed.

Upvotes: 1

Related Questions