Reputation: 1200
I'm opening a text file and removing the first line to prepare it for importing in a database using bulk insert. Here is my code:
string tempFile = Path.GetTempFileName();
using (var sr = new StreamReader("F:\\Upload\\File.txt", System.Text.Encoding.UTF8))
{
using (var sw = new StreamWriter(tempFile,true, System.Text.Encoding.UTF8))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Substring(0, 8) != "Nr. Crt.")
sw.WriteLine(line);
}
}
}
System.IO.File.Delete("F:\\Upload\\File.txt");
System.IO.File.Move(tempFile, "F:\\Upload\\File.txt");
After this if I open the resulting file, Unicode characters are replaced with other characters. For example strings containing non-breaking space (unicode U+00A0): Value
(note the unicode char ) are transformed in Value�
.
How can I avoid this?
Edit:
Notepad++ is set to 'Encode in UTF-8' Here is a picture of how it looks :
Upvotes: 3
Views: 4454
Reputation: 942438
are transformed in Value�
The byte values for those 3 odd characters are 0xef 0xbd 0xbf. Which is the utf8 encoding for codepoint \ufffd, the replacement character �. Which is used when reading utf encoded text and the text contains an invalid encoding byte sequence.
Pointing squarely at an issue with File.txt, it was probably not encoded in utf-8. If you have no idea what encoding was used for that file then the first guess is to pass Encoding.Default to the StreamReader constructor.
Upvotes: 7
Reputation: 1064204
It looks to me like it is writing fine, but the tool you are reading with is not expecting UTF-8. In many cases, you need to explicitly tell the tool what encoding to expect. However, a common approach is to prepend a BOM ("byte order mark"). This is simple - just use new UTF8Encoding(true)
as the encoding and it will happen automatically. In tools that don't expect a BOM this will display as a few mangled chars at the start - but most modern tools will know what it means, and will switch to UTF-8 automatically. The point is: the BOM for UTF-8, UTF-16 LE and UTF-16 BE etc are all slightly different, but recognisable. A more complete list is on wikipedia.
Upvotes: 4