Reputation: 97
I have an element <mixed>
that contains mixed content. Is it possible to use XSLT (2.0) to wrap all “words” (delimited by the pattern \s+
, for example) inside <mixed>
in a <w>
tag, descending into inline elements when necessary? For example, given the following input:
<mixed>
One morning, when <a>Gregor Samsa</a>
woke from troubled dreams, he found
himself transformed in his bed into
a <b><c>horrible vermin</c></b>.
</mixed>
I want something like the following output:
<mixed>
<w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
<w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
<w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
<w>a</w> <b><c><w>horrible</w></c></b> <w><b><c>vermin</c></b>.</w>
</mixed>
Dimitre Novatchev provided a template in an answer to this related question that goes much of the way to solving this, but does not satisfy the following requirements:
Inline elements that terminate within a “word” should be split so that a single <w>
element contains the whole “word.” Otherwise there would be invalid XML, such as:
<w>a</w> <w><b><c>horrible</w> <w>vermin</c></b>.</w>
However, this template detaches the punctuation .
after vermin
and produces:
<w>a</w> <b><c><w>horrible</w> <w>vermin</w></c></b> <w>.</(w>
(Edit: None of the current 3 answers satisfy this requirement.)
The split token must not be discarded. Consider the similar task of wrapping non-coefficient numbers in <sub>
tags in the context of a chemical formula. For example, <reactants>2H2 + O2</reactants>
becomes <reactants>2H<sub>2</sub> + O<sub>2</sub></reactants>
. This is not possible using the tokenize
function because it simply discards the separator. Instead we will probably have to fall back on analyze-string
.
If not XSLT, what is the best method to do this?
Upvotes: 1
Views: 869
Reputation: 70598
How about this XSLT, which has an extra template to cope with elements that are immediately followed by a text node containing only a full stop.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:for-each select="tokenize(., '[\s]')[.]">
<w><xsl:sequence select="."/></w>
</xsl:for-each>
</xsl:template>
<xsl:template match="text()[normalize-space() = '.']" />
<xsl:template match="node()[following-sibling::node()[1][self::text()][normalize-space() = '.']]">
<w>
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
<xsl:text>.</xsl:text>
</w>
</xsl:template>
</xsl:stylesheet>
Upvotes: 0
Reputation: 116959
AFAICT, this would provide the expected result in your example:
XSLT 2.0
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="no"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()[ancestor::mixed]">
<xsl:analyze-string select="." regex="\s+">
<xsl:matching-substring>
<xsl:value-of select="." />
</xsl:matching-substring>
<xsl:non-matching-substring>
<w>
<xsl:value-of select="." />
</w>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
However, I did not understand your point regarding "Inline elements that terminate within a “word”". What would be the expected result when, for example, a part of a word is italicized?
Upvotes: 1
Reputation: 167401
If you use analyze-string
on \S+
with
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="mixed//text()">
<xsl:analyze-string select="." regex="\S+">
<xsl:matching-substring>
<w>
<xsl:value-of select="."/>
</w>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:transform>
you get
<mixed>
<w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
<w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
<w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
<w>a</w> <b><c><w>horrible</w> <w>vermin</w></c></b><w>.</w>
</mixed>
Do you really want to join the trailing dot with the preceding vermin
that is inside of your inline elements?
Upvotes: 0