haggis78
haggis78

Reputation: 73

How can I use XSLT to locate and tag any of a large number of strings?

I have an XML file that I am transforming to HTML using XSLT 3.0 in oXygen.

Let's say that my input file looks like this:

<root>
<p>First I spoke to John Smith.</p>
<p>Then I talked with David Jones.</p>
</root>

Further, I have a list of terms that I want to tag automatically as part of the transformation. These are in a separate XML file like so:

<terms>
<term>spoke</term>
<term>talked</term>
</terms>

And I want my output HTML to look like this:

<body>
<p>First I <span class="term">spoke</span> to John Smith.</p>
<p>Then I <span class="term">talked</span> with David Jones.</p>
</body>

Naturally this could be accomplished with Regex search and replace, but I am collating a list of several hundred terms with a book-length text, so doing them one at a time is out of the question. I assume there must be an automated way to do this in my XSLT.

In my head one way it might work is using <xsl:analyze-string> something like this, except that instead of a single Regex search, I need to have it loop over all of the elements in the other XML file:

<xsl:template match="text()">
        <xsl:analyze-string select="." regex="findmywords">
            <xsl:matching-substring>
                <span class="term">
                    <xsl:value-of select="."/>
                </span>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

Or is there some way to start with xsl:for-each on the terms list, and feed it into a replace() function. But I'm not sure how to structure that so it's affecting the text output from the book XML.

Any direction would be appreciated. Sorry for the ignorance; I'm still learning.

Upvotes: 1

Views: 51

Answers (2)

Michael Kay
Michael Kay

Reputation: 163595

Can you tokenize first before doing the matching? That is, would you want to change bespoke to be<span class="term">spoke</span>? I'm assuming not.

I would suggest something like the following, adapting the regex as needed to match your tokens:

<xsl:template match="text()">
  <xsl:analyze-string select="." regex="[-A-Za-z]+">
    <xsl:matching-substring>{f:substitute(.)}</xsl:matching-substring>
    <xsl:non-matching-substring>{.}</xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>
        
<xsl:function name="f:substitute" as="item()">
  <xsl:param name="token" as="xs:string"/>
  <xsl:choose>
     <xsl:when test="key('tokenlist', $token, $tokendoc)">
        <span class="term">{$token}</span>
     </xsl:when>
     <xsl:otherwise>{$token}</xsl:otherwise>
  </xsl:choose>
</xsl:function>

<xsl:key name="tokenlist" match="term" use="."/>

I've chosen to use analyze-string here rather than tokenize because it preserves the separators (punctuation) between your tokens.

Upvotes: 3

Martin Honnen
Martin Honnen

Reputation: 167716

Sounds like you want regex="{string-join(doc('terms,xml')//term, '|')}". If any term can have special characters that would affect regular expression matching, use //term/functx:escape-for-regex(.) from the funtx library https://www.datypic.com/xsl/functx_escape-for-regex.html

Upvotes: 1

Related Questions