Yet another code page detection question

Question

OK, before you jump at me with spears and take me away to the burning battlefield of code pages, please note that I am not trying to auto-detect the code page of a text. I know that's not possible. But what I do not know to be possible is to automatically detect a code page problem. Take the following example. I have a largish text (2-3 pages) plus a "default" code page. I try to decode the text with the default code page. If I get gibberish I try to decode the text with another code page. So the question is: is it possible to somehow detect gibberish characters?

Thanks for your kind help in advance. Best Regards, Daniel

Steve Morgan · Accepted Answer

I reckon that the only practical way is to manually define some kind of 'mask' for each code page; a structure that defines all of the character values that you consider valid for each of your code pages.

Then, you could check if the page contained any character values that weren't contained in this mask.

Building the mask would involve a fair bit of manual effort. Create a page with every character, then display it using the appropriate code page and then look to see which aren't rendered 'nicely'. It's a one-off activity for each code page, so perhaps worth the effort.

Of course, if there was a way to parse a code page, you could generate this mask automatically... Hmm... Back in a bit.

Try this code fragment. It tests the characters 32-255 against each known code page.

        StringBuilder source = new StringBuilder();

        for (int ix = 0; ix < 224; ix++)
        {
            source.Append((char)(ix + 32));
        }

        EncodingInfo[] encs = Encoding.GetEncodings();

        foreach (var encInfo in encs)
        {
            System.Console.WriteLine(encInfo.DisplayName);
            Encoding enc = Encoding.GetEncoding(encInfo.CodePage);

            var result = enc.GetBytes(source.ToString().ToCharArray());

            for (int ix = 0; ix < 224; ix++)
            {
                if (result[ix] == 63 && source[ix] != 63)
                {
                    // Code page translated character to '?'
                    System.Console.Write("{0:d}", source[ix]);
                }
            }
            System.Console.WriteLine();
        }

I was looking around in the debugger and noticed that '?' is used as a fall-back character if the source character is not included in the code page. By checking for '?' (and ensuring that it wasn't '?' to start with), the code assumes that the code page couldn't handle it.

DBCS code pages may need a bit more attention, I've not looked. But try this as a starting point.

I'd use code like this to build an initial 'mask', as I described earlier, and then manually adjust that mask based on what looked good and what didn't.

Yet another code page detection question

Answers (1)

Related Questions