Dave Robinson
Dave Robinson

Reputation: 55

Can't load UTF-8 formatted text from XML file into Textbox (C#)

I have an XML file that is encoded as UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

There is one node that has Unicode embedded - to preserve French (and other) characters.

<author>Fr\u00e9d\u00e9ric</author>

I want to load this formatted text into a Textbox and show the text as expected, i.e. Frédéric

I am using the following to load the file, and everything else works as expected, just not the conversion.

System.Xml.XmlReader Reader;

Reader = System.Xml.XmlReader.Create(new StreamReader(Filename, Encoding.GetEncoding("UTF-8")));

XMLFile = XDocument.Load(Reader);

The line I use to actually extract the node information is:

var classes = XMLFile.Root.Elements("class").Select(x => x);

This is great, and allows me to extract the information exactly as I need.

It's only the formatting of this French (UTF-8) text that doesn't work as I expected. I did some research, and grabbed two other functions to assist:

private string Decode(string Encoded)
{
    System.Text.UTF8Encoding UTF8 = new System.Text.UTF8Encoding();
    Byte[] Message = UTF8.GetBytes(Encoded);

    return UTF8.GetString(Message);
}

private string Encode(string Original)
{
    System.Text.ASCIIEncoding ASCII = new System.Text.ASCIIEncoding();
    Byte[] Message = ASCII.GetBytes(Original);

    return ASCII.GetString(Message);
}

Neither of these seems to make any difference. All I get in the Textbox is Fr\\u00e9d\\u00e9ric.

What am I missing? Please help.

Upvotes: 0

Views: 1256

Answers (2)

Dave Robinson
Dave Robinson

Reputation: 55

Ok, so this is what I ended up doing:

        string Filename = "";
        string Author = "";
        XDocument XMLFile;
        System.Xml.XmlReader Reader;

        Reader = System.Xml.XmlReader.Create(new StreamReader(Filename, Encoding.GetEncoding("UTF-8")));
        XMLFile = XDocument.Load(Reader);

        if (XMLFile.Root.Element("author") != null)
            Author = Decode(XMLFile.Root.Element("author").Value);

And where the magic happens...

    private string Decode(string UnicodeString)
    {
        Regex DECODING_REGEX = new Regex(@"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
        string PLACEHOLDER = @"#!#";

        return DECODING_REGEX.Replace(UnicodeString.Replace(@"\\", PLACEHOLDER),
        m =>
        {
            return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString();
        })
        .Replace(PLACEHOLDER, @"\\");

    }

Upvotes: 0

Martin
Martin

Reputation: 2086

\u00e9 is C# syntax, use &#233; in the XML file instead.

However, as you specified UTF-8 for the XML file, if your editor correctly encodes the file as UTF-8, there is no need to use any kind of escaping, but you can simply type the characters you'd like to have. E.g. Visual Studio: File / Advanced Save Options.

Upvotes: 2

Related Questions