CSharper
CSharper

Reputation: 306

Calculate difference in encoding WITHOUT actually writing to a file?

I have some text, that will be written to 2 files using UTF-8 and 1252 encoding.

Observation when comparing these 2 files:

Question: Can I calculate which character in the UTF8 file will be represented by what character in the 1252 file without actually writing the files?

Or to put it another way: Is there more efficient code than this to find out the differences without writing to a text file?

File.WriteAllText("tmp-utf8.txt", text, Encoding.UTF8);
File.WriteAllText("tmp-cp1252.txt", text, Encoding.GetEncoding(1252));

string textUtf8 = File.ReadAllText("tmp-utf8.txt", Encoding.UTF8);
string text1252 = File.ReadAllText("tmp-cp1252.txt", Encoding.GetEncoding(1252));

if (textUtf8 != text1252)
{
    ... do something
}

Finally I want to print out something like this:

"a"->"a"
"b"->"b"
"Ф"->"F"
"σ"->"s"
"ξ"->"?"
"ψ"->"?" 

Upvotes: 0

Views: 52

Answers (1)

Charlieface
Charlieface

Reputation: 71144

You can use Encoding.GetBytes to get the exact byte representation, and SequenceEqual to compare.

var bytesUtf8 = Encoding.UTF8.GetBytes(text);
var bytes1252 = Encoding.GetEncoding(1252).GetBytes(text);
if (!bytesUtf8.AsSpan().SequenceEqual(bytes1252))
{
    // do something
}

To find the exact index of differences is difficult, because UTF-8 uses multi-byte sequences in some cases.

Maybe something like

Span<byte> bufferUtf8 = stackalloc byte[4];
Span<byte> buffer1252 = stackalloc byte[4];

for (var i = 0; i < text.Length; i++)
{
    if (!Encoding.UTF8.TryGetBytes(text.AsSpan().Slice(i, 1), bufferUtf8, out var length)
        || !Encoding.GetEncoding(1252).TryGetBytes(text.AsSpan().Slice(i, 1), buffer1252, out length)
        || !bufferUtf8.SequenceEquals(buffer1252)
    )
    {
        Console.WriteLine($"Index {i} does not match: 0x{Convert.ToHexString(bufferUtf8)} -> 0x{Convert.ToHexString(buffer1252)}");
    }
}

Upvotes: 2

Related Questions