Trishen
Trishen

Reputation: 237

Replacing characters in C# (ascii)

I got a file with characters like these: à, è, ì, ò, ù - À. What i need to do is replace those characters with normal characters eg: à = a, è = e and so on..... This is my code so far:

StreamWriter sw = new StreamWriter(@"C:/JoinerOutput.csv");
string path = @"C:/Joiner.csv";
string line = File.ReadAllText(path);

if (line.Contains("à"))
{
    string asAscii = Encoding.ASCII.GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback("a"), new DecoderExceptionFallback()), Encoding.UTF8.GetBytes(line)));
    Console.WriteLine(asAscii);
    Console.ReadLine();

    sw.WriteLine(asAscii);
    sw.Flush();
}

Basically this searches the file for a specific character and replaces it with another. The problem that i am having is that my if statement doesn't work. How do i go about solving this?

This is a sample of the input file:

Dimàkàtso Mokgàlo
Màmà Ràtlàdi
Koos Nèl
Pàsèkà Modisè
Jèrèmiàh Morèmi
Khèthiwè Buthèlèzi
Tiànà Pillày
Viviàn Màswàngànyè
Thirèshàn Rèddy
Wàdè Cornèlius
ènos Nètshimbupfè

This is the output if use : line = line.Replace('à', 'a'); :

Ch�rl�n� Kirst�n
M�m� R�tl�di
Koos N�l
P�s�k� Modis�
J�r�mi�h Mor�mi
Kh�thiw� Buth�l�zi
Ti�n� Pill�y
Vivi�n M�sw�ng�ny�
Thir�sh�n R�ddy
W�d� Corn�lius
�nos N�tshimbupf�

With my code the symbol will be removed completely

Upvotes: 10

Views: 34634

Answers (7)

realbart
realbart

Reputation: 3974

I often use an extenstion method based on the version Dana supplied. A quick explanation:

  • Normalizing to form D splits charactes like è to an e and a nonspacing `
  • From this, the nospacing characters are removed
  • The result is normalized back to form D (I'm not sure if this is neccesary)

Code:

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (str == null) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
         
        return cleanStr;
    }
}

edit

Like the name says, this just removes diacritics, This may not be wat you want:

  • In some languages, it is common to latinize characters with diacritics by replacing them with a letter combination. In German, for example, ü is replaced by ue.
  • This just removes diacritics as defined by Unicode. ö is seen as a combination of o and ̈ , but ø is not seen as a combination of o and /. Same thing for ł.
  • Combined characters like œ and æ are also left alone.

Upvotes: 8

dana
dana

Reputation: 18145

Others have commented on using a Unicode lookup table to remove Diacritics. I did a quick Google search and found this example. Code shamelessly copied, (re-formatted), and posted below:

using System;
using System.Text;
using System.Globalization;

public static class Remove
{
    public static string RemoveDiacritics(string stIn)
    {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for(int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if(uc != UnicodeCategory.NonSpacingMark) {
                sb.Append(stFormD[ich]);
            }
        }

        return(sb.ToString().Normalize(NormalizationForm.FormC));
    }
}

So, your code could clean the input by calling:

line = Remove.RemoveDiacritics(line);

Upvotes: 30

Ton Snoei
Ton Snoei

Reputation: 3195

Doing it the easy way. The code below will replace all special characters to ASCII characters in just 2 lines of code. It gives you the same result as Julien Roncaglia's solution.

byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(inputText);
string outputText = System.Text.Encoding.ASCII.GetString(bytes);

Upvotes: 3

Julien Roncaglia
Julien Roncaglia

Reputation: 17837

Don't know if it is useful but in an internal tool to write message on a led screen we have the following replacements (i'm sure that there are more intelligent ways to make this work for the unicode tables, but this one is enough for this small internal tool) :

        strMessage = Regex.Replace(strMessage, "[éèëêð]", "e");
        strMessage = Regex.Replace(strMessage, "[ÉÈËÊ]", "E");
        strMessage = Regex.Replace(strMessage, "[àâä]", "a");
        strMessage = Regex.Replace(strMessage, "[ÀÁÂÃÄÅ]", "A");
        strMessage = Regex.Replace(strMessage, "[àáâãäå]", "a");
        strMessage = Regex.Replace(strMessage, "[ÙÚÛÜ]", "U");
        strMessage = Regex.Replace(strMessage, "[ùúûüµ]", "u");
        strMessage = Regex.Replace(strMessage, "[òóôõöø]", "o");
        strMessage = Regex.Replace(strMessage, "[ÒÓÔÕÖØ]", "O");
        strMessage = Regex.Replace(strMessage, "[ìíîï]", "i");
        strMessage = Regex.Replace(strMessage, "[ÌÍÎÏ]", "I");
        strMessage = Regex.Replace(strMessage, "[š]", "s");
        strMessage = Regex.Replace(strMessage, "[Š]", "S");
        strMessage = Regex.Replace(strMessage, "[ñ]", "n");
        strMessage = Regex.Replace(strMessage, "[Ñ]", "N");
        strMessage = Regex.Replace(strMessage, "[ç]", "c");
        strMessage = Regex.Replace(strMessage, "[Ç]", "C");
        strMessage = Regex.Replace(strMessage, "[ÿ]", "y");
        strMessage = Regex.Replace(strMessage, "[Ÿ]", "Y");
        strMessage = Regex.Replace(strMessage, "[ž]", "z");
        strMessage = Regex.Replace(strMessage, "[Ž]", "Z");
        strMessage = Regex.Replace(strMessage, "[Ð]", "D");
        strMessage = Regex.Replace(strMessage, "[œ]", "oe");
        strMessage = Regex.Replace(strMessage, "[Œ]", "Oe");
        strMessage = Regex.Replace(strMessage, "[«»\u201C\u201D\u201E\u201F\u2033\u2036]", "\"");
        strMessage = Regex.Replace(strMessage, "[\u2026]", "...");

One thing to note is that if in most language the text is still understandable after such a treatment it's not always the case and will often force the reader to refer to the context of the sentence to be able to understand it. Not something you want if you have the choice.


Note that the correct solution would be to use the unicode tables, replacing characters with integrated diacritics with their "combined diacritical mark(s)"+character form and then removing the diacritics...

Upvotes: 11

Iain Collins
Iain Collins

Reputation: 6884

Sounds like what you want to do is convert Extended ASCII (eight-bit) to ASCII (seven-bit) - so searching for that might help.

I've seen libraries to handle this in other languages but have never had to do it in C#, this looks like it might be somewhat enlightening though:

Convert two ascii characters to their 'corresponding' one character extended ascii representation

Upvotes: 0

CloudyMarble
CloudyMarble

Reputation: 37566

Use this:

     if (line.Contains(“OldChar”))
     {
        line = line.Replace(“OldChar”, “NewChar”);
     }

Upvotes: 0

Jon
Jon

Reputation: 437434

Why are you making things complicated?

line = line.Replace('à', 'a');

Update:

The docs for File.ReadAllText say:

This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.

Use the ReadAllText(String, Encoding) method overload when reading files that might contain imported text, because unrecognized characters may not be read correctly.

What encoding is C:/Joiner.csv in? Maybe you should use the other overload for File.ReadAllText where you specify the input encoding yourself?

Upvotes: 3

Related Questions