Enissay
Enissay

Reputation: 4953

Regex for tokenize in xQuery

using xPath i'm getting a text like this:

Sed id felis mi; Nam porta lacinia sapien vestibulum egestas; Praesent nec nisl purus, eget mollis metus. Fusce euismod ante id tellus tincidunt dignissim ornare magna blandit. Nunc id risus quam.

I want to split it into two variables :

var1 = text from the beginning till the 1st dot => if this part contains more than 10 words (separated by a blank space) and contains a semicolon ';', then it will take text from the beginning till the 1st semicolon.

var2 = the right part of the text.

I started with this code, but it doesn't give me what I want (I didn't treated the 10 words condition yet):

let $left := data(tokenize($doc//div/blockquote/p/text(), '^(.*?)[;|.](.*?)$')[1])
let $right := data(tokenize($doc//div/blockquote/p/text(), '^(.*?)[;|.](.*?)$')[2])

Thanks in advance.

Upvotes: 2

Views: 1211

Answers (2)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243529

Can be done even without using tokenize() or any RegEx:

   for $s in 'Sed id felis mi; Nam porta lacinia sapien vestibulum egestas; Praesent nec nisl purus, eget mollis metus. Fusce euismod ante id tellus tincidunt dignissim ornare magna blandit. Nunc id risus quam.',
       $vBeforeDot in substring-before($s, '.'),
       $vBeforeSemiC in substring-before($s, ';')
      return
         ($vBeforeDot
                       [string-length(normalize-space(.))
                       - string-length(translate(normalize-space(.), ' ', ''))
                       le 9
                       ],
        $vBeforeSemiC
        )[1]

Upvotes: 4

Cylian
Cylian

Reputation: 11182

Try this

for $p in doc('file:///c:/test.xml')//div/blockquote/p/text()
    return 
        if (count(tokenize(tokenize($p,'[.]')[1],'\s+')) gt 10) then
            (tokenize($p,'[.]')[1])
        else
            (tokenize($p,';')[1])

For reference see fn:tokenize.

Upvotes: 3

Related Questions