atif
atif

Reputation: 1147

parsing string in xslt

I have following xml

<xml>
<xref>
 is determined &ldquo;in prescribed manner&rdquo;
</xref>
</xml>

I want to see if we can process xslt 2 and return the following result

<xml>
<xref>
   is
</xref>
 <xref>
   determined
</xref>
 <xref>
   &ldquo;in prescribed manner&rdquo;
</xref>
</xml>

I tried few options like replace the space and entities and then using for-each loop but not able to work it out. May be we can use tokenize function of xslt 2.0 but don't know how to use it. Any hint will be helpful.

Upvotes: 2

Views: 1044

Answers (1)

Marcus Rickert
Marcus Rickert

Reputation: 4238

@ JimGarrison: Sorry, I couldn't resist. :-) This XSLT is definitely not elegant but it does (I assume) most of the job:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
    version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" />

  <xsl:variable name="left_quote" select="'&lt;'"/>
  <xsl:variable name="right_quote" select="'&gt;'"/>

  <xsl:template name="protected_tokenize">
    <xsl:param name="string"/>

    <xsl:variable name="pattern" select="concat('^([^', $left_quote, ']+)(', $left_quote, '[^', $right_quote, ']*', $right_quote,')?(.*)')"/>

    <xsl:analyze-string select="$string" regex="{$pattern}">
      <xsl:matching-substring>

        <!-- Handle the prefix of the string up to the first opening quote by "normal" tokenizing. -->
        <xsl:variable name="prefix" select="concat(' ', normalize-space(regex-group(1)))"/>
        <xsl:for-each select="tokenize(normalize-space($prefix), ' ')">
          <xref>
            <xsl:value-of select="."/>
          </xref>
        </xsl:for-each>

        <!-- Handle the text between the quotes by simply passing it through. -->
        <xsl:variable name="protected_token" select="normalize-space(regex-group(2))"/>
        <xsl:if test="$protected_token != ''">
          <xref>
            <xsl:value-of select="$protected_token"/>
          </xref>
        </xsl:if>

        <!-- Handle the suffix of the string. This part may contained protected tokens again. So we do it recursively. -->
        <xsl:variable name="suffix" select="normalize-space(regex-group(3))"/>
        <xsl:if test="$suffix != ''">
          <xsl:call-template name="protected_tokenize">
            <xsl:with-param name="string" select="$suffix"/>
          </xsl:call-template>
        </xsl:if>

      </xsl:matching-substring>
    </xsl:analyze-string>
  </xsl:template>

  <xsl:template match="*|@*">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="xref">
    <xsl:call-template name="protected_tokenize">
      <xsl:with-param name="string" select="text()"/>
    </xsl:call-template>
  </xsl:template>

</xsl:stylesheet>

Notes:

  • There is the general assumption that white space only serves as a token delimiter and need not be preserved.
  • &ldquo; and rdquo; seem to be invalid in XML although they are valid in HTML. In the XSLT there are variables defined holding the quote characters. They will have to be adapted once you find the right XML representation. You can also eliminate the variables and put the characters right into the regular expression pattern. It will be significantly simplified by this.
  • <xsl:analyze-string> does not allow a regular expression which may evaluate into an empty string. This comes as a little problem since either the prefix and/or the proteced token and/or the suffix may be empty. I take care of this by artificially adding a space at the beginning of the pattern which allows me to search for the prefix using + (at least one occurence) instead of * (zero or more occurences).

Upvotes: 1

Related Questions