Wytz
Wytz

Reputation: 23

Recursive Regex in SPARQL query to identify matching parentheses

I'm trying to use SPARQL to query literals that have regexes with balanced parentheses. So "( (1) ((2)) (((3))) 4)" should be returned, but "( (1) ((2)) (((3)) 4)", where I removed a closing parenthesis after the "3", should not be returned.

I've previously looked here for a suitable regex: Regular expression to match balanced parentheses

And have been trying to implement regex suggested by rogal111, which is as follows:

\(([^()]|(?R))*\)

This regex follows the PCRE syntax, which I understand is the W3C standard and should be followed by SPARQL. According to the linked example http://regex101.com/r/lF0fI1/1 this should work for the examples above.

I've tested this on both a Jena based triple store, and a Virtuoso based triple store.

Jena: when I try to implement it for SPARQL with the query below, it says that the (?R) inline modifier is unknown.

SELECT ?lf
WHERE
{
  BIND("(test)" AS ?l)
  FILTER REGEX(?l, "\\(([^()]|(?R))*\\)").
}

The complete error message that is returned is below.

Regex pattern exception: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 11 \(([^()]|(?R))*\)

Virtuoso: The Virtuoso based triple store (tested on: https://sparql.uniprot.org/sparql) does work, but also returns incorrect outputs, as exemplified with the query below:

SELECT ?lf
WHERE
{
  BIND("((test)" AS ?l)
  FILTER REGEX(?l, "\\(([^()]|(?R))*\\)").
}

I'm not sure whether this is intentional, a bug, or that I'm doing something wrong. Ultimately I want to get it to work on the Jena based triplestore. Can anyone help me with this?

Upvotes: 2

Views: 206

Answers (1)

Damyan Ognyanov
Damyan Ognyanov

Reputation: 786

Just to clarify and augment my comment about the use of REPLACE, the following should work:

SELECT * 
{
    VALUES ?value { 
        "( (1) ((2)) (((3))) 4)" 
        "( (1) ((2)) (((3)) 4)"
        "before (test) after" 
        "before ((test) after"
    }
    bind(!regex(
            replace(?value, '(?=\\()(?:(?=.*?\\((?!.*?\\1)(.*\\)(?!.*\\2).*))(?=.*?\\)(?!.*?\\2)(.*)).)+?.*?(?=\\1)[^(]*(?=\\2$)', '') 
            , '[()]') as ?result)
}

Upvotes: 1

Related Questions