Statick
Statick

Reputation: 33

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

I've got two strings which are derived from Windows filenames, which contain unicode characters that do not display correctly in Windows (they show just the square box "unknown character" instead of the correct character). However the filenames are valid and these files exist without problems in the operating system, which means I need to be able to deal with them correctly and accurately.

I'm loading the filenames the usual way:

string path = @"c:\folder";
foreach (FileInfo file in DirectoryInfo.EnumerateFiles(path))
{
    string filename = file.FullName;
}

but for the purposes of explaining this problem, these are the two filenames I'm having issues with:

string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";

Two strings, two filenames with a single unicode character plus an extension, both different. This so far is fine, I can read and write these files no problem, however I need to store these strings in a sqlite db and later retrieve them. Every attempt I make to do so results in both of these characters being changed to the "unknown character", so the original data is lost and I can no longer differentiate the two strings. At first I thought this was an sqlite issue, and I've made sure my db is in UTF16, but it turns out it's the conversion in c# to UTF16 that is causing the problem.

If I ignore sqlite entirely, and simply try to manually convert these strings to UTF16 (or to any other encoding), these characters are converted to the "unknown character" and the original data is lost. If I do this:

System.Text.Encoding enc = System.Text.Encoding.Unicode;
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
byte[] name1Bytes = enc.GetBytes(filename1);
byte[] name2Bytes = enc.GetBytes(filename2);

and I then inspect the bytearrays 'name1Bytes' and 'name2Bytes' they are both identical. and I can see that the unicode character in both cases has been converted to a pair of bytes 253 and 255 - the unknown character. and sure enough when I convert back

string newFilename1 = enc.GetString(name1Bytes);
string newFilename2 = enc.GetString(name2Bytes);

the orignal unicode character in each case is lost, and replaced with a diamond question mark symbol. I have lost the original filenames altogether.

It seems that these encoding conversions rely on the system font being able to display the characters, and this is a problem as these strings already exist as filenames, and changing the filenames isn't an option. I need to preserve this data somehow when sending it to sqlite, and when it's sent to sqlite it will go through a conversion process to UTF16, and it's this conversion that I need it to survive without losing data.

Upvotes: 2

Views: 538

Answers (1)

Petter Hesselberg
Petter Hesselberg

Reputation: 5498

If you cast a char to an int, you get the numeric value, bypassing the Unicode conversion mechanism:

foreach (char ch in filename1)
{
    int i = ch; // 0x0000de18 == 56856 for the first char in filename1
    ... do whatever, e.g., create an int array, store it as base64
}

This turns out to work as well, and is perhaps more elegant:

foreach (int ch in filename1)
{
    ...
}

So perhaps something like this:

string Encode(string raw)
{
    byte[] bytes = new byte[2 * raw.Length];
    int i = 0;
    foreach (int ch in raw)
    {
        bytes[i++] = (byte)(ch & 0xff);
        bytes[i++] = (byte)(ch >> 8);
    }

    return Convert.ToBase64String(bytes);
}

string Decode(string encoded)
{
    byte[] bytes = Convert.FromBase64String(encoded);
    char[] chars = new char[bytes.Length / 2];
    for (int i = 0; i < chars.Length; ++i)
    {
        chars[i] = (char)(bytes[i * 2] | (bytes[i * 2 + 1] << 8));
    }

    return new string(chars);
}

Upvotes: 1

Related Questions