Christopher
Christopher

Reputation: 10627

How do I preserve special characters when writing XML with XDocument.Save()?

My source XML has the copyright character in it as ©. When writing the XML with this code:

var stringWriter = new StringWriter();
segmentDoc.Save(stringWriter);
Console.WriteLine(stringWriter.ToString());

it is rendering that copyright character as a little "c" with a circle around it. I'd like to preserve the original code so it gets spit back out as ©. How can I do this?

Update: I also noticed that the source declaration looks like <?xml version="1.0" encoding="utf-8"?> but my saved output looks like <?xml version="1.0" encoding="utf-16"?>. Can I indicate that I want the output to still be utf-8? Would that fix it?

Update2: Also, &#x00A0; is getting output as ÿ. I definitely don't want that happening!

Update3: &#x00A7; is becoming a little box and that is wrong, too. It should be §

Upvotes: 2

Views: 4490

Answers (4)

Ben Valentine
Ben Valentine

Reputation: 11

i had the same problem when saving some lithuanian characters in this way. i found a way to cheat around this by replacing & with &amp; (&amp;#x00A9; to write &#x00A9; and so on) it looks strange but it worked for me :)

Upvotes: 1

kbrimington
kbrimington

Reputation: 25652

It appears that UTF8 won't solve the problem. The following has the same symptoms as your code:

MemoryStream ms = new MemoryStream();
XmlTextWriter writer = new XmlTextWriter(ms, new UTF8Encoding());
segmentDoc.Save(writer);
ms.Seek(0L, SeekOrigin.Begin);
var reader = new StreamReader(ms);
var result = reader.ReadToEnd();
Console.WriteLine(result);

I tried the same approach with ASCII, but wound up with ? instead of ©.

I think using a string replace after converting the XML to a string is your best bet to get the effect you want. Of course, that could be cumbersome if you are interested in more than just the @copy; symbol.

result = result.Replace("©", "\u0026#x00A9;");

Upvotes: 0

Jon Skeet
Jon Skeet

Reputation: 1500525

I strongly suspect you won't be able to do this. Fundamentally, the copyright sign is &#x00A9; - they're different representations of the same thing, and I expect that the in-memory representation normalizes this.

What are you doing with the XML afterwards? Any sane application processing the resulting XML should be fine with it.

You may be able to persuade it to use the entity reference if you explicitly encode it with ASCII... but I'm not sure.

EDIT: You can definitely make it use a different encoding. You just need a StringWriter which reports that its "native" encoding is UTF-8. Here's a simple class you can use for that:

public class Utf8StringWriter : StringWriter
{
    public override Encoding Encoding
    {
         get { return Encoding.UTF8; }
    }
}

You could try changing it to use Encoding.ASCII as well and see what that does to the copyright sign...

Upvotes: 4

Ivo
Ivo

Reputation: 3436

Maybe you can try to diffent document encoding, check out: http://www.sagehill.net/docbookxsl/CharEncoding.html

Upvotes: 0

Related Questions