Reputation: 73
I have an XML file that I am transforming to HTML using XSLT 3.0 in oXygen.
Let's say that my input file looks like this:
<root>
<p>First I spoke to John Smith.</p>
<p>Then I talked with David Jones.</p>
</root>
Further, I have a list of terms that I want to tag automatically as part of the transformation. These are in a separate XML file like so:
<terms>
<term>spoke</term>
<term>talked</term>
</terms>
And I want my output HTML to look like this:
<body>
<p>First I <span class="term">spoke</span> to John Smith.</p>
<p>Then I <span class="term">talked</span> with David Jones.</p>
</body>
Naturally this could be accomplished with Regex search and replace, but I am collating a list of several hundred terms with a book-length text, so doing them one at a time is out of the question. I assume there must be an automated way to do this in my XSLT.
In my head one way it might work is using <xsl:analyze-string>
something like this, except that instead of a single Regex search, I need to have it loop over all of the elements in the other XML file:
<xsl:template match="text()">
<xsl:analyze-string select="." regex="findmywords">
<xsl:matching-substring>
<span class="term">
<xsl:value-of select="."/>
</span>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
Or is there some way to start with xsl:for-each on the terms list, and feed it into a replace() function. But I'm not sure how to structure that so it's affecting the text output from the book XML.
Any direction would be appreciated. Sorry for the ignorance; I'm still learning.
Upvotes: 1
Views: 51
Reputation: 163595
Can you tokenize first before doing the matching? That is, would you want to change bespoke
to be<span class="term">spoke</span>
? I'm assuming not.
I would suggest something like the following, adapting the regex as needed to match your tokens:
<xsl:template match="text()">
<xsl:analyze-string select="." regex="[-A-Za-z]+">
<xsl:matching-substring>{f:substitute(.)}</xsl:matching-substring>
<xsl:non-matching-substring>{.}</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:function name="f:substitute" as="item()">
<xsl:param name="token" as="xs:string"/>
<xsl:choose>
<xsl:when test="key('tokenlist', $token, $tokendoc)">
<span class="term">{$token}</span>
</xsl:when>
<xsl:otherwise>{$token}</xsl:otherwise>
</xsl:choose>
</xsl:function>
<xsl:key name="tokenlist" match="term" use="."/>
I've chosen to use analyze-string here rather than tokenize because it preserves the separators (punctuation) between your tokens.
Upvotes: 3
Reputation: 167716
Sounds like you want regex="{string-join(doc('terms,xml')//term, '|')}"
. If any term can have special characters that would affect regular expression matching, use //term/functx:escape-for-regex(.)
from the funtx library https://www.datypic.com/xsl/functx_escape-for-regex.html
Upvotes: 1