rossjha
rossjha

Reputation: 191

XSLT search XML using regex, word boundries

Is it possible to use regex to search XML content using XSLT? I can search nodes using contains, however i need to use word boundries (e.g. /\bmy phrase\b/i) to search for a phrase and not just individual word.

When searching for 'blood pressure' using the following, all nodes with 'blood', 'pressure' and 'blood pressure' are returned.

I only want nodes containing 'blood pressure' to be returned. Using PHP preg_match, i can achieve this using: /\b$keywords\b/i

<xsl:template match="//item">
    <xsl:choose>
        <xsl:when test="contains(translate(title, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword) or contains(translate(content, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), $keyword)">
            <item>
                <title><xsl:value-of select="title"/></title>
                <content><xsl:value-of select="content"/></content>
                <date><xsl:value-of select="date"/></date>
                <author><xsl:value-of select="author"/></author>
            </item>
        </xsl:when>
    </xsl:choose>
</xsl:template>

Upvotes: 3

Views: 1160

Answers (3)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243459

I. You may do something like this in XSLT 2.0:

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="s">
  <xsl:variable name="vWords" select=
  "tokenize(lower-case(string(.)),
            '[\s.?!,;—:\-]+'
            ) [.]
  "/>
  <xsl:sequence select=
   " for $current in .,
         $i in 1 to count($vWords)
     return
        if($vWords[$i] eq 'blood'
          and
           $vWords[$i+1] eq 'pressure'
           )
           then .
           else ()
  "/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When this XSLT 2.0 transformation is applied to the following XML document (no such document provided in the question!!!):

<t>
 <s>He has high blood pressure.</s>
 <s>He has high Blood Pressure.</s>
 <s>He has high Blood
 Pressure.</s>

  <s>He was  coldblood Pressured.</s>

</t>

the wanted, correct result (only elements containing `"blood" and "pressure" (case-insensitive and as two adjacent words) is produced:

<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
 Pressure.</s>

Explanation:

  1. Using the tokenize() function to split on strings of nn-letter characters, with flags for case-insensitivity and multi-line mode.

  2. Iterating through the result of tokenize() to find a "blood" word followed immediately by a "pressure" word.


II. An XSLT 1.0 solution:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vUpper" select=
 "'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>

 <xsl:variable name="vLower" select=
 "'abcdefghijklmnopqrstuvwxyz'"/>

 <xsl:variable name="vSpaaaceeees" select=
 "'                                                                               '
 "/>

 <xsl:variable name="vAlpha" select="concat($vLower, $vUpper)"/>

 <xsl:template match="s">
   <xsl:variable name="vallLower" select="translate(., $vUpper, $vLower)"/>
     <xsl:copy-of select=
     "self::*
       [contains
        (concat
         (' ',
          normalize-space
           (translate($vallLower, translate($vallLower, $vAlpha, ''), $vSpaaaceeees)),
          ' '
          ),

         ' blood pressure '
         )
       ]
  "/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

when this transformation is applied on the same XML document (above), the same correst result is produced:

<s>He has high blood pressure.</s>
<s>He has high Blood Pressure.</s>
<s>He has high Blood
 Pressure.</s>

Explanation:

  1. Converting to lowercase.

  2. Using the double-translate method to replace any non-alpha character to a space.

  3. Then using normalize-space() to replace any group of adjacent spaces with a single space.

  4. Then surrounding this result with spaces.

  5. Finally, verifying if the current result contains the string " blood pressure ".

Upvotes: 2

kirilloid
kirilloid

Reputation: 14304

http://www.w3.org/TR/xslt20/#regular-expressions

The regular expressions used by this instruction, and the flags that control the interpretation of these regular expressions, must conform to the syntax defined in [Functions and Operators] (see Section 7.6.1 Regular Expression Syntax), which is itself based on the syntax defined in [XML Schema Part 2].

First link the from quote shows us no presence of \b.

The same for second link Single Character Escape

But if we scroll in the last document a bit, we can find character classes (Category Escape). And use combination of punctuation and space classes: [\p{P}\p{Z}] in order to achieve similar effect.

Upvotes: 0

Martin Honnen
Martin Honnen

Reputation: 167571

XSLT and XPath 2.0 do have a matches function supporting regular expressions, XSLT and XPath 1.0 don´t have such a function, you would need to use an extension function your XSLT processor supports: http://www.exslt.org/regexp/functions/match/index.html. However even with XSLT/XPath 2.0 I think the regular expression language supported does not support any "word boundary" pattern.

Upvotes: 0

Related Questions