Reputation: 340
Let's say I have a file with this input:
"Crème donut, $1.00"
If a user uploads the file incorrectly encoded as ANSI and I parse it using TextFieldParser() with UTF8 encoding set to throw an exception on invalid bytes, it will correctly through an exception. It will report:
"Unable to translate bytes [E8] at index 321 from specified code page to Unicode."
The property "UnknownBytes" contains the byte array with a single entry of [232]. 232 is the decimal equivalent of E8. What's odd is that "è" should really be Byte[2] { 195, 168} I believe.
I would like to report back to the user what character caused the discrepancy.
What is the best way to do this?
If I return Encoding.UTF8.GetString(ex.UnknownBytes), it returns the Unicode replacement character instead of "è". Presumably this is because "232" as a single byte is invalid.
What am I missing? It seems like I have all the information I need to be helpful to the user, but I'm unable to communicate it.
Upvotes: 1
Views: 2029
Reputation: 340
I see the issue. In my example I was using "è" as a foreign character. This is \xE8 in ANSI but \xC3\xA8 in UTF8. If I tried to render \xE8 in UTF8, or any Unicode encoding I believe, it wouldn't know what I was asking for since \xE8 isn't a valid hex value for the code point U+00E8.
I ended up using the following code which will work for my circumstances given my regional settings on my servers:
catch (DecoderFallbackException ex)
{
var ansiEncoding = Encoding.Default;
var ansiOutput = ansiEncoding.GetString(ex.BytesUnknown);
throw new PageException("This file contains unexpected characters. The following character was found in the file: " + ansiOutput);
}
Upvotes: 3