Reputation: 27
I want to read text file which includes information about its encoding in its content. I don't know what encoding is used before I read the file. I use System.IO.File.ReadAllText
for reading the file. How can I convert encoding without reading the file again?
I was trying to specify default encoding while reading the file and then converting it to final encoding, but it doesn't convert correctly:
string input = File.ReadAllText(filePath, Encoding.Default);
Encoding encoding = GetEncodingFromInput(input);
input = encoding.GetString(Encoding.Convert(Encoding.Default, encoding, Encoding.Default.GetBytes(input)));
Converted string doesn't contain the same characters as when it was read with correct encoding. Some characters are changed to question marks.
Upvotes: 0
Views: 1367
Reputation: 131237
From various comments it appears the text is in the IBM Extended 8-bit ASCII
codepage, also known as 437. To load files in that codepage use Encoding.GetEncoding(437)
, eg :
var cp437=Encoding.GetEncoding(437);
var input = File.ReadAllText(filePath, cp437);
The ?
or �
characters are the conversion error replacement characters returned when trying to read text using the wrong codepage. It's not possible to recover the original text from them.
Encoding.Default is the system's default codepage, not some .NET-wide default. As the docs say:
The Default property in the .NET Framework In the .NET Framework on the Windows desktop, the Default property always gets the system's active code page and creates a Encoding object that corresponds to it. The active code page may be an ANSI code page, which includes the ASCII character set along with additional characters that vary by code page. Because all Default encodings based on ANSI code pages lose data, consider using the Encoding.UTF8 encoding instead. UTF-8 is often identical in the U+00 to U+7F range, but can encode characters outside the ASCII range without loss.
Finally, both File.ReadAllText and the StreamReader class it uses will try to detect the encoding from the file's BOM (Byte Order Marks) and fall back to UTF8 if no BOM is found.
Detecting codepages
There's no reliable way to detect the encoding as many codepages may use the same bytes. One can only identify bad matches reliably because the resulting text will contain �
What one can do is load the file's bytes once and try multiple encodings, eliminating those that contain �
. Another step would be to check for expected non-English words or characters and eliminate the encodings that don't produce them.
Encoding.GetEncodings() will return all registered encodings. A rough method that finds probable encodings could be :
IEnumerable<Encoding> DetectEncodings(byte[] buffer)
{
var candidates=from enc in Encoding.GetEncodings()
let text=enc.GetString(byte)
where !text.Contains('�')
select enc;
return candidates;
}
or, using value tuples :
IEnumerable<(Encoding,string)> DetectEncodings(byte[] buffer)
{
var candidates=from enc in Encoding.GetEncodings()
let text=enc.GetString(byte)
where !text.Contains('�')
select (enc,text);
return candidates;
}
Upvotes: 1
Reputation: 155065
I don't know what encoding is used before I read the file.
Usually files that self-declare their encoding somehow have a documented technique or method for finding it - check your file format's published documentation.
If not, here's a few common techniques:
System.IO.StreamReader
does by default.
0xEF 0xBB 0xBF
at the beginning).text/*
-family of file-formats, with the encoding declared in some kind of header then you can read the first kilobyte of the file into a buffer and interpret every consecutive byte valued under 0x7F
as a character in an ASCII string, then use a simple parser (even String.IndexOf
) or a Regex to look for your header's delimiter.
<meta http-equiv="Content-Type" />
to get the encoding name.I use System.IO.File.ReadAllText for reading the file. How can I convert encoding without reading the file again?
You don't. Only use ReadAllText
for simple text/plain
files with consistent and known encoding - for this scenario else you'll need to use Stream
and StreamReader
(and possibly BinaryReader
) together.
Upvotes: 4
Reputation: 71
Use System.IO.File.ReadAllBytes
to read the file, and then de-encode the byte array after you know which encoding you need, using something like: System.Text.Encoding.XXXX.GetString()
Upvotes: 1