filter invalid values in json string

Question

I'm getting a string in a html body that I am trying to process into valid json. The string I receive isn't a valid json string and contains the following schema:

äÄ
    "key1": "  10",
    "key2": "beigef}gtem Zahlschein",
    "key3": "     G E L \ S C H T",
    "key4": "M}nchen",
    "key5": "M{rz",
    "key6": "[huus"
Ü
ä

I've written a function to replace all the faulty characters to create a valid json-string, but how do i do the reverse without destroying the letters needed in json?

This is how I replaced the characters:

private static string FixChars(string input)
    {
        if (!string.IsNullOrEmpty(input))
        {
            if (input.Contains("["))
            {
                input = input.Replace("[", "Ä");
            }
            if (input.Contains(@"\"))
            {
                input = input.Replace(@"\", "Ö");
            }
            if (input.Contains("]"))
            {
                input = input.Replace("]", "Ü");
            }
            if (input.Contains("{"))
            {
                input = input.Replace("{", "ä");
            }
            if (input.Contains("|"))
            {
                input = input.Replace("|", "ö");
            }
            if (input.Contains("}"))
            {
                input = input.Replace("}", "ü");
            }
            if (input.Contains("~"))
            {
                input = input.Replace("~", "ß");
            }
            //DS_Stern hat Probleme beim xml erstellen gemacht
            //if (input.Contains("*"))
            //{
            //    input = input.Replace("*", "Stern");
            //}
        }
        return input;
    }

Then I've tried to deserialize the json-array into an Dictionary like this:

deserializedRequest = JsonConvert.DeserializeObject[]>(json);

How do I access the different dictionaries, use my FixChars-method on the values and reserialize a valid json-string from that?

EDIT: IBM273 and decoding via IBM037 works fine to create a valid json string, but still contains a minor error: the character 'ö' is '|' in that encoding.

dbc · Accepted Answer

It looks as though the HTML page containing your JSON was encoded into a byte stream on your Unisys A-Series type of machine (cobol74) using one encoding and then decoded by your code using a different encoding, thereby causing some characters to get remapped or lost. To fix your problem, you need to determine the original encoding used on that Unisys computer, and decode the HTML stream using it. Making things a little more complicated is that we're not sure which encoding .Net chose to decode the HTML either.

One way to make the determination is to take a sample of the expected JSON, then encode it and decode it using all possible pairs of encodings available in .Net. If any pair of encodings produces the incorrect results you are seeing, then the encoding used to encode the string may possibly be the one used on the Unisys computer. And, by reversing the transformation you may be able to fix your string, assuming no characters were dropped.

The following code does this test:

var correctString = "{}[]";
var observedString = "äüÄÜ";

int count = 0;
foreach (var toEncoding in Encoding.GetEncodings())
    foreach (var fromEncoding in Encoding.GetEncodings())
    {
        var s = toEncoding.GetEncoding().GetString(fromEncoding.GetEncoding().GetBytes(correctString));
        if (s == observedString)
        {
            Console.WriteLine(string.Format("Match Found: Encoding via {0} and decoding via {1}", fromEncoding.Name, toEncoding.Name));
            count++;
        }
    }
Console.WriteLine("Found {0} matches", count);

This produces 147 matches, including a bunch of pairs of ebcdic encodings. For the full list see this fiddle.

Next, let's try to cut down on the matches by testing the full JSON string:

var correctJson = @"{[
    ""key1"": ""  10"",
    ""key2"": ""beigefügtem Zahlschein"",
    ""key3"": ""     G E L Ö S C H T"",
    ""key4"": ""München"",
    ""key5"": ""März"",
    ""key6"": ""Ähuus"",
    ""key7"": ""ö"",
    ""key8"": ""ß"",
]
{";
var observedJson = @"äÄ
    ""key1"": ""  10"",
    ""key2"": ""beigef}gtem Zahlschein"",
    ""key3"": ""     G E L \ S C H T"",
    ""key4"": ""M}nchen"",
    ""key5"": ""M{rz"",
    ""key6"": ""[huus"",
    ""key7"": ""|"",
    ""key8"": ""~"",
Ü
ä";

int count = 0;
foreach (var toEncoding in Encoding.GetEncodings())
    foreach (var fromEncoding in Encoding.GetEncodings())
    {
        var s = toEncoding.GetEncoding().GetString(fromEncoding.GetEncoding().GetBytes(correctJson));
        if (s == observedJson)
        {
            Console.WriteLine(string.Format("Match Found: Encoding via {0} and decoding via {1}", fromEncoding.Name, toEncoding.Name));
            count++;
        }
    }
Console.WriteLine("Found {0} matches", count);

This produces just 2 EBCDIC matches:

Match Found: Encoding via IBM01141 and decoding via IBM870
Match Found: Encoding via IBM273 and decoding via IBM870

So one of these is almost certainly the correct pair of encodings. But, which one? According to wikipedia:

CCSID 1141 is the Euro currency update of code page/CCSID 273. In that code page, the "¤" (currency) character at code point 9F is replaced with the "€" (Euro) character.

So to narrow down the encoding to a single choice, you'll need to test a sample with the "€" character.

Then if I add the following extension method:

public static class TextExtensions
{
    public static string Reencode(this string s, Encoding toEncoding, Encoding fromEncoding)
    {
        return toEncoding.GetString(fromEncoding.GetBytes(s));
    }
}

I can fix your JSON by doing:

var fixedJson = observedJson.Reencode(Encoding.GetEncoding("IBM01141"), Encoding.GetEncoding("IBM870"));
Console.WriteLine(fixedJson);

filter invalid values in json string

Answers (1)

Related Questions