Reputation: 55
I have an XML file that is encoded as UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
There is one node that has Unicode embedded - to preserve French (and other) characters.
<author>Fr\u00e9d\u00e9ric</author>
I want to load this formatted text into a Textbox and show the text as expected, i.e. Frédéric
I am using the following to load the file, and everything else works as expected, just not the conversion.
System.Xml.XmlReader Reader;
Reader = System.Xml.XmlReader.Create(new StreamReader(Filename, Encoding.GetEncoding("UTF-8")));
XMLFile = XDocument.Load(Reader);
The line I use to actually extract the node information is:
var classes = XMLFile.Root.Elements("class").Select(x => x);
This is great, and allows me to extract the information exactly as I need.
It's only the formatting of this French (UTF-8) text that doesn't work as I expected. I did some research, and grabbed two other functions to assist:
private string Decode(string Encoded)
{
System.Text.UTF8Encoding UTF8 = new System.Text.UTF8Encoding();
Byte[] Message = UTF8.GetBytes(Encoded);
return UTF8.GetString(Message);
}
private string Encode(string Original)
{
System.Text.ASCIIEncoding ASCII = new System.Text.ASCIIEncoding();
Byte[] Message = ASCII.GetBytes(Original);
return ASCII.GetString(Message);
}
Neither of these seems to make any difference. All I get in the Textbox is Fr\\u00e9d\\u00e9ric
.
What am I missing? Please help.
Upvotes: 0
Views: 1256
Reputation: 55
Ok, so this is what I ended up doing:
string Filename = "";
string Author = "";
XDocument XMLFile;
System.Xml.XmlReader Reader;
Reader = System.Xml.XmlReader.Create(new StreamReader(Filename, Encoding.GetEncoding("UTF-8")));
XMLFile = XDocument.Load(Reader);
if (XMLFile.Root.Element("author") != null)
Author = Decode(XMLFile.Root.Element("author").Value);
And where the magic happens...
private string Decode(string UnicodeString)
{
Regex DECODING_REGEX = new Regex(@"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
string PLACEHOLDER = @"#!#";
return DECODING_REGEX.Replace(UnicodeString.Replace(@"\\", PLACEHOLDER),
m =>
{
return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString();
})
.Replace(PLACEHOLDER, @"\\");
}
Upvotes: 0
Reputation: 2086
\u00e9
is C# syntax, use é
in the XML file instead.
However, as you specified UTF-8 for the XML file, if your editor correctly encodes the file as UTF-8, there is no need to use any kind of escaping, but you can simply type the characters you'd like to have. E.g. Visual Studio: File / Advanced Save Options.
Upvotes: 2