Reputation: 365
I was given with a String variable with the following content:
<main>
<Title title="Hello World" />
<Content content="bla bla bla... by <1% to ??? on other bla bla...." />
</main>
This string will eventually passed to a Stored Procedure for XQuery.
As you can see, the content of "Content" contains of char "<" , which when I try to parse in Stored Procedure, it return with an error.
My question is how to convert the "<" into < ; (in this case <1% to < ;1%) in an efficient way.
I want to retain other "<" as it is.
Tks
Upvotes: 0
Views: 114
Reputation: 155503
Since you updated your question to point out you are dealing with XML, but the unencoded values are in attribute values, not #text
nodes, then it makes it somewhat simpler, just extract the attribute value using a similar approach to my previous answer, then use a library function to entitize it, then output.
Note that CDATA only applies to #text
, not attributes.
String doc =
@"<main>
<Title title=""Hello World"" />
<Content content=""bla bla bla... by <1% to ??? on other bla bla...."" />
</main>";
Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentAttribContentValueStart = doc.IndexOf("content=\"", contentOpenStart) + "content=\"".Length;
Int32 contentAttibContentValueEnd = doc.IndexOf("\"", contentAttribContentValueStart);
String attributeValueOld = doc.Substring( contentAttribContentValueStart, contentAttibContentValueEnd );
String attributeValueNew = System.Net.WebUtility.HtmlEncode( attributeValueOld );
String doc2 = String.Concat(
doc.Substring( 0, contentAttribContentValueStart );
attributeValueNew,
doc.Substring( contentAttibContentValueEnd );
);
doc2
then contains the fixed attribute value.
Note that using HtmlEncode
to perform HTML-Encoding of entities is not strictly correct in XML, as the set of XML entities is much smaller than HTML's - indeed, XML is only concerned with &
, >
, <
, "
and '
, all other values should be in the document as raw/native characters.
Upvotes: 1
Reputation: 155503
(This answer is based on the assumption you're dealing with structurally correct XML, just with unencoded entities in #text
nodes - this answer does not apply if your input data really does look like <Title="foo" />
- which isn't XML at all)
If I understand your problem correctly, you have an XML document in a String
instance which contains improperly escaped/entitized special characters, which prevents you from using a normal XML parser to read the document.
If you're dealing with an XML-compliant system, then you can use <![DATA[
and then not need to attempt to process the content of the <Content>
element, the trick then becomes inserting the CDATA delimiters.
While it's often said one cannot use a regular-expression to parse XML (as XML is not a Regular Language), you can take advantage of the grammatical rules of XML to extract and identify tags.
So if you have this:
<Content someAttribute="someValue">
reduce sales by <1% in order to ensure that profit > loss
</Content>
Then you can do this:
String doc = @"<main><Title...";
Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentOpenEnd = doc.IndexOf(">", contentOpenStart);
Int32 contentCloseStart = doc.IndexOf("</Content>", contentOpenEnd);
This code then tells us the locatations of the angle-brackets of the <Content>
element's two tags, with which we can insert the CDATA delimiters:
String newDocument = String.Concat(
doc.Substring( 0, contentOpenEnd + 1 ), // "<main>...<Content...>"
"<![CDATA[",
doc.Substring( contentOpenEnd + 1, contentCloseStart ),
"]]>",
doc.Substring( contentCloseStart ) "</Content>..."
);
newDocument
will then be this:
<Content someAttribute="someValue"><![CDATA[
reduce sales by <1% in order to ensure that profit > loss
]]></Content>
...which is valid XML.
Upvotes: 0