Reputation: 306
I have some text, that will be written to 2 files using UTF-8 and 1252 encoding.
Observation when comparing these 2 files:
Question: Can I calculate which character in the UTF8 file will be represented by what character in the 1252 file without actually writing the files?
Or to put it another way: Is there more efficient code than this to find out the differences without writing to a text file?
File.WriteAllText("tmp-utf8.txt", text, Encoding.UTF8);
File.WriteAllText("tmp-cp1252.txt", text, Encoding.GetEncoding(1252));
string textUtf8 = File.ReadAllText("tmp-utf8.txt", Encoding.UTF8);
string text1252 = File.ReadAllText("tmp-cp1252.txt", Encoding.GetEncoding(1252));
if (textUtf8 != text1252)
{
... do something
}
Finally I want to print out something like this:
"a"->"a"
"b"->"b"
"Ф"->"F"
"σ"->"s"
"ξ"->"?"
"ψ"->"?"
Upvotes: 0
Views: 52
Reputation: 71144
You can use Encoding.GetBytes
to get the exact byte representation, and SequenceEqual
to compare.
var bytesUtf8 = Encoding.UTF8.GetBytes(text);
var bytes1252 = Encoding.GetEncoding(1252).GetBytes(text);
if (!bytesUtf8.AsSpan().SequenceEqual(bytes1252))
{
// do something
}
To find the exact index of differences is difficult, because UTF-8 uses multi-byte sequences in some cases.
Maybe something like
Span<byte> bufferUtf8 = stackalloc byte[4];
Span<byte> buffer1252 = stackalloc byte[4];
for (var i = 0; i < text.Length; i++)
{
if (!Encoding.UTF8.TryGetBytes(text.AsSpan().Slice(i, 1), bufferUtf8, out var length)
|| !Encoding.GetEncoding(1252).TryGetBytes(text.AsSpan().Slice(i, 1), buffer1252, out length)
|| !bufferUtf8.SequenceEquals(buffer1252)
)
{
Console.WriteLine($"Index {i} does not match: 0x{Convert.ToHexString(bufferUtf8)} -> 0x{Convert.ToHexString(buffer1252)}");
}
}
Upvotes: 2