Egor Skriptunoff
Egor Skriptunoff

Reputation: 23747

Why doesn't a non-greedy quantifier sometimes work in Oracle regex?

IMO, this query should return A=1,B=2,

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,') as A_and_B FROM dual

But it returns the whole string, A=1,B=2,C=3,, instead. Why?


Update 1:

Oracle 10.2+ is required to use Perl-style metacharacters in regular expressions.

Update 2:

A more clear form of my question (to avoid questions about Oracle version and availability of Perl-style regex extension):

On the same system, why does a non-greedy quantifier sometimes work as expected and sometimes not?

This works correctly:

regexp_substr('A=1,B=2,C=3,', 'B=.*?,')

This doesn't work:

regexp_substr('A=1,B=2,C=3,', '.*B=.*?,')

Fiddle

Update 3:

Yes, it seems to be a bug.

What is the Oracle support reaction on this issue?

Is the bug already known? Does it have an ID?

Upvotes: 29

Views: 7203

Answers (4)

Old Pro
Old Pro

Reputation: 25547

It's a BUG!

You are right that in Perl, 'A=1,B=2,C=3,' =~ /.*B=.*?,/; print $& prints A=1,B=2,

What you have stumbled upon is a bug that still exists in Oracle Database 11g R2. If the exact same regular expression atom (including the quantifier but excluding the greediness modifier) appears twice in a regular expression, both occurrences will have the greediness indicated by the first appearance regardless of the greediness specified by the second one. That this is a bug is clearly demonstrated by these results (here, "the exact same regular expression atom" is [^B]*):

SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^Bx]*?,') as good FROM dual;

GOOD
--------
A=1,B=2,

SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^B]*?,') as bad FROM dual;

BAD
-----------
A=1,B=2,C=3,

The only difference between the two regular expressions is that the "good" one excludes 'x' as a possible match in the second matching list. Since 'x' does not appear in the target string, excluding it should make no difference, but as you can see, removing the 'x' makes a big difference. That has to be a bug.

Here are some more examples from Oracle 11.2: (SQL Fiddle with even more examples)

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,')  FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*,')   FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*?,') FROM dual; =>  A=1,B=2,
SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*,')  FROM dual; =>  A=1,B=2,
-- Changing second operator from * to +
SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+?,')  FROM dual; =>  A=1,B=2,
SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+,')   FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.+B=.+,')   FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.+?B=.+,')  FROM dual; =>  A=1,B=2,

The pattern is consistent: the greediness of the first occurrence is used for the second occurrence whether it should be or not.

Upvotes: 24

exussum
exussum

Reputation: 18560

Because you're selecting too much:

SELECT
  regexp_substr(
    'A=1,B=2,C=3,',
    '.*?B=.*?,'
  ) as A_and_B,  -- Now works as expected
  regexp_substr(
    'A=1,B=2,C=3,',
    'B=.*?,'
  ) as B_only    -- works just fine
FROM dual

SQL Fiddle: http://www.sqlfiddle.com/#!4/d41d8/11450

Upvotes: -1

Andrew Wolfe
Andrew Wolfe

Reputation: 2096

You've got a really great bounty, so I'm going to try to nail it comprehensively.

You make assumptions in your regular expression handling that are incorrect.

  1. Oracle is NOT compatible with Perl regular expressions, it is compatible with POSIX. It describes its support for Perl as "Perl-Influenced"
  2. There is an intrinsic syntax conflict around the use of the Perl "*?" in Oracle, if you read that reference the way I do, and Oracle legitimately chooses the POSIX usage
  3. Your description of how perl handles "*?" is not quite right.

Here is a mashup of the options we've discussed. The key to this issue is around case 30

    CASE    SRC                             TEXT               RE                FROM_WHOM                                          RESULT        
    ------- ------------------------------- ------------------ ----------------- -------------------------------------------------- --------------
          1 Egor's original source string   A=1,B=2,C=3,       .*B=.*?,          Egor's original pattern "doesn't work"             A=1,B=2,C=3,  
          2 Egor's original source string   A=1,B=2,C=3,       .*B=.?,           Egor's "works correctly"                           A=1,B=2,      
          3 Egor's original source string   A=1,B=2,C=3,       .*B=.+?,          Old Pro comment 1 form 2                           A=1,B=2,      
          4 Egor's original source string   A=1,B=2,C=3,       .+B=.*?,          Old Pro comment 1 form 1                           A=1,B=2,      
          5 Egor's original source string   A=1,B=2,C=3,       .*B=.{0,}?,       Old Pro comment 2                                  A=1,B=2,      
          6 Egor's original source string   A=1,B=2,C=3,       [^B]*B=[^Bx]*?,   Old Pro answer form 1 "good"                       A=1,B=2,      
          7 Egor's original source string   A=1,B=2,C=3,       [^B]*B=[^B]*?,    Old Pro answer form 2 "bad"                        A=1,B=2,C=3,  
          8 Egor's original source string   A=1,B=2,C=3,       (.)*B=(.)*?,      TBone answer form 1                                A=1,B=2,      
          9 TBone answer example 2          1_@_2_a_3_@_4_a    (\w)*?@(\w)*      TBone answer example 2 form 1                      1_@_2_a_3_    
         10 TBone answer example 2          1_@_2_a_3_@_4_a    (\w)*@(\w)*?      TBone answer example 2 form 2                      1_@           
         30 Egor's original source string   A=1,B=2,C=3,       .*B=(.)*?,        Schemaczar Variant to force Perl operation         A=1,B=2,      
         31 Egor's original source string   A=1,B=2,C=3,       .*B=(.*)?,        Schemaczar Variant of Egor to force POSIX          A=1,B=2,C=3,  
         32 Egor's original source string   A=1,B=2,C=3,       .*B=.*{0,1}       Schemaczar Applying Egor's  'non-greedy'           A=1,B=2,C=3,  
         33 Egor's original source string   A=1,B=2,C=3,       .*B=(.)*{0,1}     Schemaczar Another variant of Egor's "non-greedy"  A=1,B=2,C=3,  

I am pretty sure that CASE 30 is what you thought you were writing - that is, you thought the "*?" had a stronger association than the "*" by itself. True for Perl, I guess, but for Oracle (and presumably canonical POSIX) RE's, the "*?" has a lower precedence and associativity than "*". So Oracle reads it as "(.*)?" (case 31) whereas Perl reads it as "(.)*?", that is, case 30.

Note cases 32 and 33 indicate that "*{0,1}" does not work like "*?".

Note that Oracle REGEXP does not work like LIKE, that is, it does not require the match pattern to cover the entire test string. Using the "^" begin and "$" end markers might help you with this as well.

My script:

SET SERVEROUTPUT ON

<<DISCREET_DROP>> begin
  DBMS_OUTPUT.ENABLE;
  for dropit in (select 'DROP TABLE ' || TABLE_NAME || ' CASCADE CONSTRAINTS' AS SYNT
  FROM TABS WHERE TABLE_NAME IN ('TEST_PATS', 'TEST_STRINGS')
  )
  LOOP
    DBMS_OUTPUT.PUT_LINE('Dropping via ' || dropit.synt);
    execute immediate dropit.synt;
  END LOOP;
END DISCREET_DROP;
/

--------------------------------------------------------
--  DDL for Table TEST_PATS
--------------------------------------------------------

  CREATE TABLE TEST_PATS 
   (    RE VARCHAR2(2000), 
  FROM_WHOM VARCHAR2(50), 
  PAT_GROUP VARCHAR2(50), 
  PAT_ORDER NUMBER(9,0)
   ) ;
/
--------------------------------------------------------
--  DDL for Table TEST_STRINGS
--------------------------------------------------------

  CREATE TABLE TEST_STRINGS 
   (    TEXT VARCHAR2(2000), 
  SRC VARCHAR2(200), 
  TEXT_GROUP VARCHAR2(50), 
  TEXT_ORDER NUMBER(9,0)
   ) ;
/
--------------------------------------------------------
--  DDL for View REGEXP_TESTER_V
--------------------------------------------------------

  CREATE OR REPLACE FORCE VIEW REGEXP_TESTER_V (CASE_NUMBER, SRC, TEXT, RE, FROM_WHOM, RESULT) AS 
  select pat_order as case_number,
  src, text, re, from_whom, 
  regexp_substr (text, re) as result
from test_pats full outer join test_strings on (text_group = pat_group)
order by pat_order, text_order;
/
REM INSERTING into TEST_PATS
SET DEFINE OFF;
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.*?,','Egor''s original pattern "doesn''t work"','Egor',1);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.?,','Egor''s "works correctly"','Egor',2);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.)*?,','Schemaczar Variant to force Perl operation','Egor',30);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.*)?,','Schemaczar Variant of Egor to force POSIX','Egor',31);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.*{0,1}','Schemaczar Applying Egor''s  ''non-greedy''','Egor',32);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.)*{0,1}','Schemaczar Another variant of Egor''s "non-greedy"','Egor',33);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('[^B]*B=[^Bx]*?,','Old Pro answer form 1 "good"','Egor',6);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('[^B]*B=[^B]*?,','Old Pro answer form 2 "bad"','Egor',7);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.+?,','Old Pro comment 1 form 2','Egor',3);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.{0,}?,','Old Pro comment 2','Egor',5);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.+B=.*?,','Old Pro comment 1 form 1','Egor',4);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(.)*B=(.)*?,','TBone answer form 1','Egor',8);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(\w)*?@(\w)*','TBone answer example 2 form 1','TBone',9);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(\w)*@(\w)*?','TBone answer example 2 form 2','TBone',10);
REM INSERTING into TEST_STRINGS
SET DEFINE OFF;
Insert into TEST_STRINGS (TEXT,SRC,TEXT_GROUP,TEXT_ORDER) values ('A=1,B=2,C=3,','Egor''s original source string','Egor',1);
Insert into TEST_STRINGS (TEXT,SRC,TEXT_GROUP,TEXT_ORDER) values ('1_@_2_a_3_@_4_a','TBone answer example 2','TBone',2);

COLUMN SRC FORMAT A50 WORD_WRAP
COLUMN TEXT  FORMAT A50 WORD_WRAP
COLUMN RE FORMAT A50 WORD_WRAP
COLUMN FROM_WHOM FORMAT A50 WORD_WRAP
COLUMN RESULT  FORMAT A50 WORD_WRAP

SELECT * FROM REGEXP_TESTER_V;

Upvotes: 0

tbone
tbone

Reputation: 15473

Looking at the feedback, I hesitate to jump in, but here I go ;-)

According to the Oracle docs, the *? and +? match a "preceding subexpression". For *? specifically:

Matches zero or more occurrences of the preceding subexpression (nongreedyFootref 1). Matches the empty string whenever possible.

To create a subexpression group, use parenthesis ():

Treats the expression within the parentheses as a unit. The expression can be a string or a complex expression containing operators.

You can refer to a subexpression in a back reference.

This will allow you to use greedy and non-greedy (many alternating times actually) in the same regexp, with expected results. For your example:

select regexp_substr('A=1,B=2,C=3,', '(.)*B=(.)*?,') from dual;

To make the point a bit more clear (i hope), this example uses greedy and non-greedy in the same regexp_substr, with different (correct) results depending on where the ? is placed (it does NOT just use the rule for the first subexpression it sees). Also note that the subexpression (\w) will match alphanumerics and underscore only, not @.

-- non-greedy followed by greedy 
select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*?@(\w)*') from dual;

result: 1_@_2_a_3_

-- greedy followed by non-greedy
select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*@(\w)*?') from dual;

result: 1_@

Upvotes: 7

Related Questions