Reputation: 191
Is it possible to use regex to search XML content using XSLT? I can search nodes using contains, however i need to use word boundries (e.g. /\bmy phrase\b/i
) to search for a phrase and not just individual word.
When searching for 'blood pressure' using the following, all nodes with 'blood', 'pressure' and 'blood pressure' are returned.
I only want nodes containing 'blood pressure' to be returned. Using PHP preg_match, i can achieve this using: /\b$keywords\b/i
<xsl:template match="//item">
<xsl:choose>
<xsl:when test="contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword) or contains(translate(content, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword)">
<item>
<title><xsl:value-of select="title"/></title>
<content><xsl:value-of select="content"/></content>
<date><xsl:value-of select="date"/></date>
<author><xsl:value-of select="author"/></author>
</item>
</xsl:when>
</xsl:choose>
</xsl:template>
Upvotes: 3
Views: 1160
Reputation: 243459
I. You may do something like this in XSLT 2.0:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="s">
<xsl:variable name="vWords" select=
"tokenize(lower-case(string(.)),
'[\s.?!,;—:\-]+'
) [.]
"/>
<xsl:sequence select=
" for $current in .,
$i in 1 to count($vWords)
return
if($vWords[$i] eq 'blood'
and
$vWords[$i+1] eq 'pressure'
)
then .
else ()
"/>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
When this XSLT 2.0 transformation is applied to the following XML document (no such document provided in the question!!!):
<t>
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Pressure.</s>
<s>He was coldblood Pressured.</s>
</t>
the wanted, correct result (only elements containing `"blood" and "pressure" (case-insensitive and as two adjacent words) is produced:
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Pressure.</s>
Explanation:
Using the tokenize()
function to split on strings of nn-letter characters, with flags for case-insensitivity and multi-line mode.
Iterating through the result of tokenize()
to find a "blood"
word followed immediately by a "pressure"
word.
II. An XSLT 1.0 solution:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vUpper" select=
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="vLower" select=
"'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="vSpaaaceeees" select=
"' '
"/>
<xsl:variable name="vAlpha" select="concat($vLower, $vUpper)"/>
<xsl:template match="s">
<xsl:variable name="vallLower" select="translate(., $vUpper, $vLower)"/>
<xsl:copy-of select=
"self::*
[contains
(concat
(' ',
normalize-space
(translate($vallLower, translate($vallLower, $vAlpha, ''), $vSpaaaceeees)),
' '
),
' blood pressure '
)
]
"/>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
when this transformation is applied on the same XML document (above), the same correst result is produced:
<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
Pressure.</s>
Explanation:
Converting to lowercase.
Using the double-translate method to replace any non-alpha character to a space.
Then using normalize-space()
to replace any group of adjacent spaces with a single space.
Then surrounding this result with spaces.
Finally, verifying if the current result contains the string " blood pressure "
.
Upvotes: 2
Reputation: 14304
http://www.w3.org/TR/xslt20/#regular-expressions
The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators] (see Section 7.6.1 Regular Expression Syntax), which is itself based on the syntax defined in [XML Schema Part 2].
First link the from quote shows us no presence of \b
.
The same for second link Single Character Escape
But if we scroll in the last document a bit, we can find character classes (Category Escape
). And use combination of punctuation
and space
classes: [\p{P}\p{Z}]
in order to achieve similar effect.
Upvotes: 0
Reputation: 167571
XSLT and XPath 2.0 do have a matches function supporting regular expressions, XSLT and XPath 1.0 don´t have such a function, you would need to use an extension function your XSLT processor supports: http://www.exslt.org/regexp/functions/match/index.html. However even with XSLT/XPath 2.0 I think the regular expression language supported does not support any "word boundary" pattern.
Upvotes: 0