Trowa
Trowa

Reputation: 365

Replace some char within a string (XML format)

I was given with a String variable with the following content:

<main>
<Title title="Hello World" />
<Content content="bla bla bla... by <1% to ??? on other bla bla...." />
</main>

This string will eventually passed to a Stored Procedure for XQuery.

As you can see, the content of "Content" contains of char "<" , which when I try to parse in Stored Procedure, it return with an error.

My question is how to convert the "<" into &lt ; (in this case <1% to &lt ;1%) in an efficient way.

I want to retain other "<" as it is.

Tks

Upvotes: 0

Views: 114

Answers (2)

Dai
Dai

Reputation: 155503

Since you updated your question to point out you are dealing with XML, but the unencoded values are in attribute values, not #text nodes, then it makes it somewhat simpler, just extract the attribute value using a similar approach to my previous answer, then use a library function to entitize it, then output.

Note that CDATA only applies to #text, not attributes.

String doc =
@"<main>
<Title title=""Hello World"" />
<Content content=""bla bla bla... by <1% to ??? on other bla bla...."" />
</main>";

Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentAttribContentValueStart = doc.IndexOf("content=\"", contentOpenStart) + "content=\"".Length;
Int32 contentAttibContentValueEnd    = doc.IndexOf("\"", contentAttribContentValueStart);

String attributeValueOld = doc.Substring( contentAttribContentValueStart, contentAttibContentValueEnd );
String attributeValueNew = System.Net.WebUtility.HtmlEncode( attributeValueOld );

String doc2 = String.Concat(
    doc.Substring( 0, contentAttribContentValueStart );
    attributeValueNew,
    doc.Substring( contentAttibContentValueEnd );
);

doc2 then contains the fixed attribute value.

Note that using HtmlEncode to perform HTML-Encoding of entities is not strictly correct in XML, as the set of XML entities is much smaller than HTML's - indeed, XML is only concerned with &amp;, &gt;, &lt;, &quot; and &apos;, all other values should be in the document as raw/native characters.

Upvotes: 1

Dai
Dai

Reputation: 155503

(This answer is based on the assumption you're dealing with structurally correct XML, just with unencoded entities in #text nodes - this answer does not apply if your input data really does look like <Title="foo" /> - which isn't XML at all)

If I understand your problem correctly, you have an XML document in a String instance which contains improperly escaped/entitized special characters, which prevents you from using a normal XML parser to read the document.

If you're dealing with an XML-compliant system, then you can use <![DATA[ and then not need to attempt to process the content of the <Content> element, the trick then becomes inserting the CDATA delimiters.

While it's often said one cannot use a regular-expression to parse XML (as XML is not a Regular Language), you can take advantage of the grammatical rules of XML to extract and identify tags.

So if you have this:

<Content someAttribute="someValue">
reduce sales by <1% in order to ensure that profit > loss
</Content>

Then you can do this:

String doc = @"<main><Title...";
Int32 contentOpenStart = doc.IndexOf("<Content");
Int32 contentOpenEnd   = doc.IndexOf(">", contentOpenStart);

Int32 contentCloseStart = doc.IndexOf("</Content>", contentOpenEnd);

This code then tells us the locatations of the angle-brackets of the <Content> element's two tags, with which we can insert the CDATA delimiters:

String newDocument = String.Concat(
    doc.Substring( 0, contentOpenEnd + 1 ), // "<main>...<Content...>"
    "<![CDATA[",
    doc.Substring( contentOpenEnd + 1, contentCloseStart ),
    "]]>",
    doc.Substring( contentCloseStart ) "</Content>..."
);

newDocument will then be this:

<Content someAttribute="someValue"><![CDATA[
reduce sales by <1% in order to ensure that profit > loss
]]></Content>

...which is valid XML.

Upvotes: 0

Related Questions