lumaluis
lumaluis

Reputation: 113

Encode text in c# utf-8 without BOM

I tried but didn't function, I want to encode without BOM but with the option false still encoding in utf-8 with BOM.

Here is my code

System.Text.Encoding outputEnc = new System.Text.UTF8Encoding(false);
                return File(outputEnc.GetBytes(" <?xml version=\"1.0\" encoding=\"utf-8\"?>" + xmlString), "application/xml", id);

Upvotes: 4

Views: 7619

Answers (1)

rmunn
rmunn

Reputation: 36688

This question is more than two years old, but I've found the answer. The reason you were seeing a BOM in the output is because there's a BOM in your input. What appears to be a space at the start of your XML declaration is actually a BOM followed by a space. To prove it, select the text " < from your XML encoding (the opening double-quote, the space following it, and the opening < character) and paste that into any tool that tells you Unicode codepoints. For example, pasting that text into http://www.babelstone.co.uk/Unicode/whatisit.html gave me the following result:

U+0022 : QUOTATION MARK
U+FEFF : ZERO WIDTH NO-BREAK SPACE [ZWNBSP] (alias BYTE ORDER MARK [BOM])
U+0020 : SPACE [SP]
U+003C : LESS-THAN SIGN

You can also copy and paste from the " < that I put in this answer: I copied those characters from your question, so they contain the invisible BOM immediately before the space character.

This is why I often refer to the BOM as a BOM(b) -- because it sits there silently, hidden, waiting to blow up on you when you least expect it. You were using System.Text.UTF8Encoding(false) correctly. It didn't add a BOM, but the source that you copied and pasted your XML from contained a BOM, so you got one in your output anyway because you had one in your input.

Personal rant: It's a very good idea to leave BOMs out of your UTF-8 encoded text. However, some broken tools (Microsoft, I'm looking at you since you're the ones who made most of them) will misinterpret text if it doesn't contain a BOM, so adding a BOM to UTF-8 encoded text is sometimes necessary. But it should really be avoided as much as possible. UTF-8 is now the de facto default encoding for the Internet, so any text file whose encoding is unknown should be parsed as UTF-8 first, falling back to "legacy" encodings like Windows-1252, Latin-1, etc. only if parsing the document as UTF-8 fails.

Upvotes: 3

Related Questions