Reputation: 310
I have the feeling there's an obvious solution out there, but I can't think of it. Using XSLT 2.0 I want to tokenize a string that's distributed across child elements, so it's something like
<line>
<font style="big">
<text color="blue">wha</text>
</font>
<font style="small">
<text color="red">t is o</text>
</font>
<font style="small">
<text color="blue">n </text>
</font>
<font style="small">
<text color="blue">his </text>
</font>
<font style="small">
<text color="blue">mind.</text>
</font>
</line>
I would like to tokenize the value of the string, i.e., split the string on blanks and punctuation marks, but still keep each segment in its tree structure. So what I want to get:
<line>
<token>
<font style="big">
<text color="blue">wha</text>
</font>
<font style="small">
<text color="red">t</text>
</font>
</token>
<token>
<font style="small">
<text color="red">is</text>
</font>
</token>
<token>
<font style="small">
<text color="red">o</text>
</font>
<font style="small">
<text color="blue">n</text>
</font>
</token>
<token>
<font style="small">
<text color="blue">his</text>
</font>
</token>
<token>
<font style="small">
<text color="blue">mind</text>
</font>
</token>
<token>
<font style="small">
<text color="blue">.</text>
</font>
</token
</line>
I.E., move every word and punctuation mark into a seperate token element. Now, with just a string, that's easy, and I could use one of analyze-string or matches(), but I can't find an elegant and robust solution for this task.
I'll be thrilled to hear your ideas, Ruprecht
Upvotes: 1
Views: 583
Reputation: 5652
This does half the job, tokenising the strings, it doesn't add your <token>
markup as if I understand it correctly that requires dictionary lookup to recognise words. It produces
<line>
<font style="big">
<text color="blue">wha</text>
</font>
<font style="small">
<text color="red">t</text>
</font>
<font style="small">
<text color="red">is</text>
</font>
<font style="small">
<text color="red">o</text>
</font>
<font style="small">
<text color="blue">n</text>
</font>
<font style="small">
<text color="blue">his</text>
</font>
<font style="small">
<text color="blue">mind.</text>
</font>
</line>
stylesheet:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="font">
<xsl:variable name="fa" select="@*"/>
<xsl:for-each select="text">
<xsl:variable name="ta" select="@*"/>
<xsl:for-each select="text()/tokenize(.,'\s+')[.]">
<font>
<xsl:copy-of select="$fa"/>
<text>
<xsl:copy-of select="$ta"/>
<xsl:value-of select="."/>
</text>
</font>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
OK updated after clarification in comments, it now generates
<line>
<token>
<font style="big">
<text color="blue">wha</text>
</font>
<font style="small">
<text color="red">t</text>
</font>
</token>
<token>
<font style="small">
<text color="red">is</text>
</font>
</token>
<token>
<font style="small">
<text color="red">o</text>
</font>
<font style="small">
<text color="blue">n</text>
</font>
</token>
<token>
<font style="small">
<text color="blue">his</text>
</font>
</token>
<token>
<font style="small">
<text color="blue">mind</text>
</font>
</token>
<token>
<font style="small">
<text color="blue">.</text>
</font>
</token>
</line>
xslt:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[font]">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:variable name="p1">
<xsl:apply-templates/>
</xsl:variable>
<xsl:for-each-group select="$p1/*" group-starting-with="tok">
<token>
<xsl:copy-of select="current-group() except self::tok"/>
</token>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
<xsl:template match="font">
<xsl:variable name="fa" select="@*"/>
<xsl:for-each select="text">
<xsl:variable name="ta" select="@*"/>
<xsl:if test="position()=1 and matches(.,'^\s')"><tok/></xsl:if>
<xsl:for-each select="text()/tokenize(.,'\s+')[.]">
<xsl:if test="position()!=1"><tok/></xsl:if>
<xsl:analyze-string regex="[.,;?]" select=".">
<xsl:matching-substring>
<tok/>
<font>
<xsl:copy-of select="$fa"/>
<text>
<xsl:copy-of select="$ta"/>
<xsl:value-of select="."/>
</text>
</font>
</xsl:matching-substring>
<xsl:non-matching-substring>
<font>
<xsl:copy-of select="$fa"/>
<text>
<xsl:copy-of select="$ta"/>
<xsl:value-of select="."/>
</text>
</font>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:for-each>
<xsl:if test="position()=last() and matches(.,'\s$')"><tok/></xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Upvotes: 2