Shed Simas
Shed Simas

Reputation: 79

How do I unescape HTML, then transform it with XSLT?

I'm fairly new to XSLT, and I have a large XML document that I'm trying to transform into ICML (an XML variant used by Adobe InDesign). The relevant portion of the source document I'm working with looks something like this:

<BiographicalNote>
 &lt;p&gt;This text includes escaped HTML entities.&lt;/p&gt;
</BiographicalNote>

The XML itself is fine but the HTML it contains is escaped.

And here is a rough example of what I need the end product to look like:

<ParagraphStyleRange>
 <CharacterStyleRange>
  <Content>
   This text includes escaped HTML entities.
  </Content>
 </CharacterStyleRange>
</ParagraphStyleRange>

I can transform <BiographicalNote> to <ParagraphStyleRange><CharacterStyleRange><Content> no problem, but the escaped entities are stumping me. I can't seem to strip out the <p> tags.

Some important considerations:

My basic template looks like this:

<xsl:template match="BiographicalNote">
 <ParagraphStyleRange">
  <CharacterStyleRange>
   <Content>
   ...
   </Content>
  </CharacterStyleRange>
 </ParagraphStyleRange>
</xsl:template>

So it's what goes inside the <Content> tags I need to figure out. Here's what I've tried:

<xsl:call-template name="DescriptionParser">
 <xsl:with-param name="DescriptionText"><xsl:value-of select="." disable-output-escaping="yes" /></xsl:with-param>
</xsl:call-template>

<xsl:template name="DescriptionParser">
 <xsl:param name="DescriptionText" />
 <xsl:copy-of select="exsl:node-set($DescriptionText)/p" />
</xsl:template>

And:

<xsl:variable name="TaglineText"><xsl:value-of select="." disable-output-escaping="yes" /></xsl:variable>
<xsl:copy-of select="exsl:node-set($TaglineText)/p" />

Both of these yield and empty <Content> tag. Suspiciously, though, if select="exsl:node-set($TaglineText)", it works as expected and returns <p>This text includes escaped HTML entities.</p> with everything unescaped.

Also, using xsl:value-of instead of xsl:copy-of makes no difference when select="exsl:node-set($TaglineText)/p" (returns nothing); but when select="exsl:node-set($TaglineText)" it returns the original escaped HTML.

For some reason, it doesn't seem to recognize the <p> tag as a node, and therefore can't find it. Maybe disable-output-escaping isn't playing nice with exsl:node-set?

Can anyone tell me how to get the XSLT to recognize the <p> tags as nodes, or at the very least why this isn't working? I got most of the pieces to this puzzle from other StackOverflow topics, but I'm stumped on this bit.

Upvotes: 3

Views: 2538

Answers (1)

michael.hor257k
michael.hor257k

Reputation: 117073

I am not sure what your question is. Escaped text is not XML and cannot be processed as XML. There are no nodes you can select, so the best you can hope for is a result of:

<Content>
<p>This text includes escaped HTML entities.</p>
</Content>

which is easy to get using:

<Content>
    <xsl:value-of select="." disable-output-escaping="yes"/>
</Content>

If you want to remove the wrapping element, you must do so using string functions. If you can be sure that the wrapping element is <p> (or any other tag with string-length of 1), you can do:

<Content>
    <xsl:variable name="text" select="normalize-space(.)" />
    <xsl:value-of select="substring($text, 4, string-length($text) - 7)" disable-output-escaping="yes"/>
</Content>

Alternatively, save the result of this transformation to a file, and process the resulting file. However, this requires that the resulting file be a well-formed XML document - I understand you cannot be sure of that.

Upvotes: 3

Related Questions