Lucian Bumb
Lucian Bumb

Reputation: 2881

How to repaire string which had diacritics and was converted to strange characters?

I need to import data from an old database, and in the process I want to repair some strings which look like this:

exemple1: existing string = "GraÅ£iela" which was this "Graţiela" and I want to save it like "Gratiela"

exemple2: existing string="MÄ‚DÄ‚LINA" which was this "Mădălina" and I want to save it like "Madalina"

I am able to remove diacritics, but some strings like exemple1 and exexample2, have some strange characters, due to a bad transformation.

My question is: Do you know any way to repair this kind of strings? (other then manual!).

I have more than 50K rows with plenty of words like in above examples.

I had a try with the following :

var text = "Graţiela";
Console.WriteLine(text.Normalize());-->Graţiela
Console.WriteLine(Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text))); ---> Graţiela
Console.WriteLine(Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(text))); ---> Gra??iela
Console.WriteLine(Encoding.UTF7.GetString(Encoding.UTF7.GetBytes(text))); ---> Graţiela
Console.WriteLine(Encoding.UTF32.GetString(Encoding.UTF32.GetBytes(text))); ---> Graţiela
Console.WriteLine(Encoding.Unicode.GetString(Encoding.Unicode.GetBytes(text))); ---> Graţiela
Console.WriteLine(Encoding.BigEndianUnicode.GetString(Encoding.BigEndianUnicode.GetBytes(text))); ---> Graţiela
Console.WriteLine(Encoding.Default.GetString(Encoding.Default.GetBytes(text))); ---> Graţiela

None of this fix my issue, do you have any other idea, or there is something which is wrong in my approach?

Upvotes: 4

Views: 68

Answers (2)

Thomas Levesque
Thomas Levesque

Reputation: 292415

Your examples look like UTF-8 strings that were decoded as ISO-8859-something (Encoding.Default). To retrieve the original strings, you can reencode them to ISO-8859-x, and redecode them as UTF-8:

string FixEncoding(string badString, Encoding bad, Encoding good)
{
    var bytes = bad.GetBytes(badString);
    return good.GetString(bytes);
}

...

string fixedString = FixEncoding("GraÅ£iela", Encoding.Default, Encoding.UTF8); // Graţiela

Note that it will work only if no information was lost when the string was decoded using the wrong encoding. The safest way is to always read the string with the correct encoding; if the database contains the correct strings, make sure you're using the same encoding as the database for reading them.

To remove the diacritics, you can use this:

string RemoveDiacritics(string s)
{
    var decomposed = s.Normalize(NormalizationForm.FormD);
    var sb = new StringBuilder();
    for (int i = 0; i < decomposed.Length; i++)
    {
        var category = CharUnicodeInfo.GetUnicodeCategory(decomposed, i);
        if (category != UnicodeCategory.NonSpacingMark)
            sb.Append(decomposed[i]);
    }
    return sb.ToString().Normalize(NormalizationForm.FormC);
}

Upvotes: 3

CRefice
CRefice

Reputation: 430

You should look into using String.Normalize(). If that doesn't work, try converting the strings to a byte array and converting that back to UTF-8 (for example, using System.Text.Encoding.UTF8.GetString(byteArray))

Upvotes: 1

Related Questions