Reputation: 171784
I'm trying to unescape XML entities in a string in .NET (C#), but I don't seem to get it to work correctly.
For example, if I have the string AT&T
, it should be translated to AT&T
.
One way is to use HttpUtility.HtmlDecode(), but that's for HTML.
So I have two questions about this:
Is it safe to use HttpUtility.HtmlDecode() for decoding XML entities?
How do I use XmlReader (or something similar) to do this? I have tried the following, but that always returns an empty string:
static string ReplaceEscapes(string text)
{
StringReader reader = new StringReader(text);
XmlReaderSettings settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (XmlReader xmlReader = XmlReader.Create(reader, settings))
{
return xmlReader.ReadString();
}
}
Upvotes: 11
Views: 14010
Reputation: 563
This works as well, and has least code:
public static string DecodeString(string encodedString)
{
if (string.IsNullOrEmpty(formattedText))
return string.Empty;
XmlTextReader xtr = new XmlTextReader(encodedString, XmlNodeType.Element, null);
if (xtr.Read())
return xtr.ReadString();
throw new Exception("Error decoding xml string : " + encodedString);
}
Update1: hmm, seems it does not work if encodeString is "", then xtr.Read() return false.
Update2: added workaround
Update3: this seem to work even better
public static string DecodeString(string encodedString)
{
XmlTextReader xtr = new XmlTextReader(encodedString, XmlNodeType.Element, null);
xtr.MoveToContent();
return xtr.Value;
}
Upvotes: 1
Reputation: 96
I found that the top answer has a small bug if your input text ends with certain white space characters, like carriage returns.
The string "Testing " loses it's trailing white space.
If you combine the solution in the question with adrianbanks' wrapper tag you get the following, which works.
public static string UnescapeUnicode(string line)
{
using (StringReader reader = new StringReader("<a>" + line + "</a>"))
{
using (XmlReader xmlReader = XmlReader.Create(reader))
{
xmlReader.MoveToContent();
return xmlReader.ReadElementContentAsString();
}
}
}
Upvotes: 1
Reputation: 2385
This works:
using (XmlReader xmlReader = XmlReader.Create(reader, settings))
{
if (xmlReader.Read())
{
return xmlReader.ReadString();
}
}
Upvotes: 1
Reputation: 82944
HTML escaping and XML are closely related. as you have said, HttpUtility
has both HtmlEncode
and HtmlDecode
methods. These will also operate on XML, as there are only a few entities that need escaping: <
,>
,\
,'
and &
in both HTML and XML.
The downside of using the HttpUtility
class is that you need a reference to the System.Web
dll, which also brings in a lot of other stuff that you probably don't want.
Specifically for XML, the SecurityElement
class has an Escape
method that will do the encoding, but does not have a corresponding Unescape
method. You therefore have a few options:
HttpUtility.HtmlDecode()
and put up with a reference to System.Web
roll your own decode method that takes care of the special characters (as there are only a handful - look at the static constructor of SecurityElement
in Reflector to see the full list)
use a (hacky) solution like:
.
public static string Unescape(string text)
{
XmlDocument doc = new XmlDocument();
string xml = string.Format("<dummy>{0}</dummy>", text);
doc.LoadXml(xml);
return doc.DocumentElement.InnerText;
}
Personally, I would use HttpUtility.HtmlDecode()
if I already had a reference to System.Web
, or roll my own if not. I don't like your XmlReader
approach as it is Disposable
, which usually indicate that it is using resources that need to be disposed, and so may be a costly operation.
Upvotes: 17
Reputation: 8071
Your #2 solution can work, but you need to call xmlReader.Read();
(or xmlReader.MoveToContent();
) prior to ReadString
.
I guess #1 would be also acceptable, even though there are those edge cases like ®
which is a valid HTML entity, but not an XML entity – what should your unescaper do with it? Throw an exception as a proper XML parser, or just return “®” as the HTML parser would do?
Upvotes: 8