Colonel Panic
Colonel Panic

Reputation: 105

Handling non english characters in C#

I need to get my understanding of character sets and encoding right. Can someone point me to good write up on handling different character sets in C#?

Here's one of the problems I'm facing -

        using (StreamReader reader = new StreamReader("input.txt"))
        using (StreamWriter writer = new StreamWriter("output.txt")
        {
            while (!reader.EndOfStream)
            {
                writer.WriteLine(reader.ReadLine());
            }
        }

This simple code snippet does not always preserve the encoding -

For example -

Aukéna in the input is turned into Auk�na in the output.

Upvotes: 3

Views: 3825

Answers (4)

SamuelDavis
SamuelDavis

Reputation: 3334

You could always create your own parser. What I use is:

`var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone();

ANSI.EncoderFallback = new EncoderReplacementFallback(string.Empty);`

The first line of this creates a clone of the Win-1252 encoding (as the database I deal with works with Win-1252, you'd probably want to use UTF-8 or ASCII). The second line - when parsing characters - returns an empty string if there is no equivalent to the original character.

After this you'd want to preferably filter out all command characters (excluding tabs, spaces, line feeds and carriage returns depending on what you need).

Below is my personal encoding-parser which I set up to correct data entering our database.

private string RetainOnlyPrintableCharacters(char c)
{
//even if the character comes from a different codepage altogether, 
//if the character exists in 1252 it will be returned in 1252 format.
    var ansiBytes = _ansiEncoding.GetBytes(new char[] {c});

    if (ansiBytes.Any())
    {
        if (ansiBytes.First().In(_printableCharacters))
        {
            return _ansiEncoding.GetString(ansiBytes);
        }
    }
    return string.Empty;
}

_ansiEncoding comes from the var ANSI = (Encoding) Encoding.GetEncoding(1252).Clone(); with the fallback value set

if ansiBytes is not empty, it means that there is an encoding available for that particular character being passed in, so it is compared with a list of all the printable characters and if it exists - it is an acceptable character so is returned.

Upvotes: 0

Esteban Araya
Esteban Araya

Reputation: 29664

You just have an encoding problem. You have to remember that all you're really reading is a stream of bits. You have to tell your program how to properly interpret those bits.

To fix your problem, just use the constructors that take an encoding as well, and set it to whatever encoding your text uses.

http://msdn.microsoft.com/en-us/library/ms143456.aspx

http://msdn.microsoft.com/en-us/library/3aadshsx.aspx

Upvotes: 5

VoidStar
VoidStar

Reputation: 571

StreamReader.ReadLine() attemps to read the file using UTF encoding. If that's not the format your file uses, StreamReader will not read the characters correctly.

This article details the problem and suggests passing the constructor this encoding System.Text.Encoding.Default.

Upvotes: 2

horgh
horgh

Reputation: 18553

I guess when reading a file, you should know which encoding the file has. Otherwise you can easily fail to read it correctly.

When you know the encoding of a file, you may do the following:

        using (StreamReader reader = new StreamReader("input.txt", Encoding.GetEncoding(1251)))
        using (StreamWriter writer = new StreamWriter("output.txt", false, Encoding.GetEncoding(1251)))
        {
            while (!reader.EndOfStream)
            {
                writer.WriteLine(reader.ReadLine());
            }
        }

Another question comes up, if you want to change the original encoding of a file.

The following article may give you a good basis of what encodings are: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

And this is a link msdn article, from which you could start: Encoding Class

Upvotes: 2

Related Questions