Governor
Governor

Reputation: 27

How to change string encoding after reading it from file using default encoding?

I want to read text file which includes information about its encoding in its content. I don't know what encoding is used before I read the file. I use System.IO.File.ReadAllText for reading the file. How can I convert encoding without reading the file again?

I was trying to specify default encoding while reading the file and then converting it to final encoding, but it doesn't convert correctly:

string input = File.ReadAllText(filePath, Encoding.Default);
Encoding encoding = GetEncodingFromInput(input);
input = encoding.GetString(Encoding.Convert(Encoding.Default, encoding, Encoding.Default.GetBytes(input)));

Converted string doesn't contain the same characters as when it was read with correct encoding. Some characters are changed to question marks.

Upvotes: 0

Views: 1367

Answers (3)

Panagiotis Kanavos
Panagiotis Kanavos

Reputation: 131237

From various comments it appears the text is in the IBM Extended 8-bit ASCII codepage, also known as 437. To load files in that codepage use Encoding.GetEncoding(437), eg :

var cp437=Encoding.GetEncoding(437);
var input = File.ReadAllText(filePath, cp437);

The ? or characters are the conversion error replacement characters returned when trying to read text using the wrong codepage. It's not possible to recover the original text from them.

Encoding.Default is the system's default codepage, not some .NET-wide default. As the docs say:

The Default property in the .NET Framework In the .NET Framework on the Windows desktop, the Default property always gets the system's active code page and creates a Encoding object that corresponds to it. The active code page may be an ANSI code page, which includes the ASCII character set along with additional characters that vary by code page. Because all Default encodings based on ANSI code pages lose data, consider using the Encoding.UTF8 encoding instead. UTF-8 is often identical in the U+00 to U+7F range, but can encode characters outside the ASCII range without loss.

Finally, both File.ReadAllText and the StreamReader class it uses will try to detect the encoding from the file's BOM (Byte Order Marks) and fall back to UTF8 if no BOM is found.

Detecting codepages

There's no reliable way to detect the encoding as many codepages may use the same bytes. One can only identify bad matches reliably because the resulting text will contain

What one can do is load the file's bytes once and try multiple encodings, eliminating those that contain . Another step would be to check for expected non-English words or characters and eliminate the encodings that don't produce them.

Encoding.GetEncodings() will return all registered encodings. A rough method that finds probable encodings could be :

IEnumerable<Encoding> DetectEncodings(byte[] buffer)
{
    var candidates=from enc in Encoding.GetEncodings()
                   let text=enc.GetString(byte)
                   where !text.Contains('�')
                   select enc;
   return candidates;
}

or, using value tuples :

IEnumerable<(Encoding,string)> DetectEncodings(byte[] buffer)
{
    var candidates=from enc in Encoding.GetEncodings()
                   let text=enc.GetString(byte)
                   where !text.Contains('�')
                   select (enc,text);
   return candidates;
}

Upvotes: 1

Dai
Dai

Reputation: 155065

I don't know what encoding is used before I read the file.

Usually files that self-declare their encoding somehow have a documented technique or method for finding it - check your file format's published documentation.

If not, here's a few common techniques:

  1. Look for a Unicode BOM in the first few bytes. You can do this by first reading the first 5 bytes from the file into a buffer (or 64-bit integer) and looking them up in a dictionary. This is what System.IO.StreamReader does by default.
    • You can see a list of known BOM byte sequences here: https://en.wikipedia.org/wiki/Byte_order_mark
    • Note that UTF-8 does not have a BOM - but many editors (well, just Visual Studio) will stick 0xEF 0xBB 0xBF at the beginning).
  2. If it's a text/*-family of file-formats, with the encoding declared in some kind of header then you can read the first kilobyte of the file into a buffer and interpret every consecutive byte valued under 0x7F as a character in an ASCII string, then use a simple parser (even String.IndexOf) or a Regex to look for your header's delimiter.
    • This technique is often used for HTML files where the HTTP header declaring the encoding isn't available and the program needs to look for <meta http-equiv="Content-Type" /> to get the encoding name.

I use System.IO.File.ReadAllText for reading the file. How can I convert encoding without reading the file again?

You don't. Only use ReadAllText for simple text/plain files with consistent and known encoding - for this scenario else you'll need to use Stream and StreamReader (and possibly BinaryReader) together.

Upvotes: 4

Christoph
Christoph

Reputation: 71

Use System.IO.File.ReadAllBytes to read the file, and then de-encode the byte array after you know which encoding you need, using something like: System.Text.Encoding.XXXX.GetString()

Upvotes: 1

Related Questions