.netoracle-databaseoracle11goracle10goracle-sqldeveloper

Reputation: 104

Oracle Regex Substr to ignore optional group

I am working on regex_substr to get the following result in oracle

Test 1.Input: JOHN 10BC STUDENT Desired Output: JOHN

Test 2.Input: JOHN  STUDENT     Desired Output: JOHN

Test 3.Input: JOHN 10BC STU     Desired Output: JOHN

Test 4.Input: JOHN 10BC TEACHER Desired Output:NULL

Test 5.Input: JOHN  TEACHER     Desired Output:NULL

Test 6. Input: MR JOHN 08BC STU Desired Output: MR JOHN

Test 7. Input: MR JOHN STUDENT Desired Output: MR JOHN

Test 8. Input: MR JOHN 07BC TEACHER Desired Output: Null

Test 9. Input: MR STUART 06BC STDUENT Desired Output: MR STUART 

Test 10. Input: MR STUART LEE 05BC STDUENT Desired Output: MR STUART LEE

Query:

Test 1:

select REGEXP_SUBSTR('JOHN 10BC STUDENT','(.*)(\s+.*BC)?\sSTU(DENT)?',1,1,'i',1) from dual;

Output: Failed . Returned JOHN 10BC instead of JOHN

Test 2:

select REGEXP_SUBSTR('JOHN STUDENT','(.*)(\s+.*BC)?\sSTU(DENT)?',1,1,'i',1) from dual;

Output: Passed. Returned JOHN

I modified the query as below by removing the ? in the second block

Test 1:select REGEXP_SUBSTR('JOHN 10BC STUDENT','(.*)(\s+.*BC)\sSTU(DENT)?',1,1,'i',1) from dual; Output: Passed. Returned JOHN

Test 2:select REGEXP_SUBSTR('JOHN STUDENT','(.*)(\s+.*BC)\sSTU(DENT)?',1,1,'i',1) from dual; Output: Failed. Returned Null instead of JOHN.

How to ignore the middle optional word BC and always return the words till BC as a substring for STUDENT using REGEXP_SUBSTR. Thanks in Advance!!!

Upvotes: 0

Answers (3)

astentx

Reputation: 6751

Use non-greedy pattern for any-character match (.*) by adding a question mark after it (.*?) to prevent it from grabbing an optional part ((\s+.*BC)?):

with test (id, col) as (
  select 1, 'JOHN 10BC STUDENT'  from dual union all
  select 2, 'JOHN  STUDENT'      from dual union all
  select 3, 'JOHN 10BC STU'      from dual union all
  select 4, 'JOHN 10BC TEACHER'  from dual union all
  select 4, 'JSTUOHN 10BC TEACHER'  from dual union all
  select 5, 'JOHN  TEACHER'      from dual
)
select
  id,
  col,
  regexp_substr(col, '(.*?)(\s+.*BC)?\sSTU(DENT)?',1,1,'i',1) as qwe
from test

ID | COL                  | QWE  
-: | :------------------- | :----
 1 | JOHN 10BC STUDENT    | JOHN 
 2 | JOHN  STUDENT        | JOHN 
 3 | JOHN 10BC STU        | JOHN 
 4 | JOHN 10BC TEACHER    | null 
 4 | JSTUOHN 10BC TEACHER | null 
 5 | JOHN  TEACHER        | null

db<>fiddle here

UPD. To allow multiple words as the first part, you may use the following regex:

(.+?)(\s+\S*BC)?\s+STU(DENT)?$

Explanation:

(.+?) - match any character one or more times with the shortest match (allowing following optional groups to be matched).
(\s+\S*BC)? - optional match for the group of zero or more non-whitespace (\S) characters followed by BC. The group should be separated from the previous non-whitespace character by one or more whitespace characters. \S* instead of .* will prevent from inclusion of multiple words into this group.
\s+STU(DENT)?$ - the last part should be a word STUDENT or it's abbreviation STU. It should be the last word in the line ($ right after it).

And the result of extended test case is:

with test (id, col) as (
  select 1, 'JOHN 10BC STUDENT'  from dual union all
  select 2, 'JOHN  STUDENT'      from dual union all
  select 3, 'JOHN 10BC STU'      from dual union all
  select 4, 'JOHN 10BC TEACHER'  from dual union all
  select 4, 'JSTUOHN 10BC TEACHER'  from dual union all
  select 5, 'JOHN  TEACHER'      from dual union all
  select 6, 'MR JOHN 10BC STU' from dual union all
  select 7, 'MR STUART ABC 10BC STUDENT' from dual union all
  select 8, 'MR STUART ABC 10BC STUDENTS' from dual union all
  select 9, 'MR STUART STU 10BC TEACHER' from dual
)
select
  id,
  col,
  regexp_substr(col, '(.*?)(\s+\S*BC)?\s+STU(DENT)?$',1,1,'i',1) as qwe
from test

ID | COL                         | QWE          
-: | :-------------------------- | :------------
 1 | JOHN 10BC STUDENT           | JOHN         
 2 | JOHN  STUDENT               | JOHN         
 3 | JOHN 10BC STU               | JOHN         
 4 | JOHN 10BC TEACHER           | null         
 4 | JSTUOHN 10BC TEACHER        | null         
 5 | JOHN  TEACHER               | null         
 6 | MR JOHN 10BC STU            | MR JOHN      
 7 | MR STUART ABC 10BC STUDENT  | MR STUART ABC
 8 | MR STUART ABC 10BC STUDENTS | null         
 9 | MR STUART STU 10BC TEACHER  | null

db<>fiddle here

Upvotes: 1

MT0

Reputation: 168106

How to ignore the middle optional word 10BC and always return the first word as a substring for STUDENT?

You do not need to use (slow) regular expressions and can use (much faster) simple string functions:

WITH test (id, col) as (
  select 1, 'JOHN 10BC STUDENT'    from dual union all
  select 2, 'JOHN  STUDENT'        from dual union all
  select 3, 'JOHN 10BC STU'        from dual union all
  select 4, 'JOHN 10BC TEACHER'    from dual union all
  select 5, 'JSTUOHN 10BC TEACHER' from dual union all
  select 6, 'JOHN  TEACHER'        from dual
)
SELECT id,
       col,
       CASE
       WHEN col LIKE '% STUDENT'
       OR   col LIKE '% STU'
       THEN SUBSTR(col, 1, INSTR(col, ' ')  - 1)
       END AS first_word
FROM   test

(using astentx's data)

The output is:

ID COL FIRST_WORD

1 JOHN 10BC STUDENT JOHN

2 JOHN STUDENT JOHN

3 JOHN 10BC STU JOHN

4 JOHN 10BC TEACHER null

5 JSTUOHN 10BC TEACHER null

6 JOHN TEACHER null

ID	COL	FIRST_WORD
1	JOHN 10BC STUDENT	JOHN
2	JOHN STUDENT	JOHN
3	JOHN 10BC STU	JOHN
4	JOHN 10BC TEACHER	null
5	JSTUOHN 10BC TEACHER	null
6	JOHN TEACHER	null

db<>fiddle here

Upvotes: 0

Littlefoot

Reputation: 142798

Do you really need to check what's in the middle? Perhaps you could ignore that and work with the rest of the requirement (check for STU or STUDENT (as your sample data suggest) and fetch data according to that)?

Here are two options (result_1 and result_2), see if any of these help.

Sample data:

SQL> with test (id, col) as
  2    (select 1, 'JOHN 10BC STUDENT'  from dual union all
  3     select 2, 'JOHN  STUDENT'      from dual union all
  4     select 3, 'JOHN 10BC STU'      from dual union all
  5     select 4, 'JOHN 10BC TEACHER'  from dual union all
  6     select 5, 'JOHN  TEACHER'      from dual
  7    )

Query begins here:

  8  select id,
  9    col,
 10    case when instr(col, 'STU') > 0 then substr(col, 1, instr(col, ' ') - 1)
 11         else null
 12    end result_1,
 13    --
 14    case when regexp_substr(col, '\w+$') in ('STU', 'STUDENT') then
 15              regexp_substr(col, '^\w+')
 16         else null
 17    end result_2
 18  from test;

        ID COL               RESULT_1             RESULT_2
---------- ----------------- -------------------- --------------------
         1 JOHN 10BC STUDENT JOHN                 JOHN
         2 JOHN  STUDENT     JOHN                 JOHN
         3 JOHN 10BC STU     JOHN                 JOHN
         4 JOHN 10BC TEACHER
         5 JOHN  TEACHER

SQL>

result_1 searches col for the STU string; if it is found, it returns the first word (from beginning of col up to the first space character)
result_2 checks whether the last word (but this time using regexp_substr, anchoring to the end $ of col) is "STU" or "STUDENT" and returns the first word (anchored to the beginning ^ of col

Upvotes: 0

Oracle Regex Substr to ignore optional group

Answers (3)

Related Questions