jason Palmer
jason Palmer

Reputation: 43

Extract data using REGEXP_SUBSTR

Hi there I'm trying to extract from an oracle database part of the text contained within a single field using the REGEXP_SUBSTR function. The text in question is shown in bold text below "BRS14774366". The good news is the pattern of the data i'm trying to extract is fairly consistent in that it always begins with "-" and ends with "CSN", however the text between which I'm trying to consistently extract is not always the same and can consist of both alpha and numeric characters and be between 1-12 characters in length.

PSN932-52506252-BRS14774366CSN/SF-1/25JAN0524

Below are further examples showing the slight differences in length, again the text i'm trying to extract is shown in bold. As you can see the position is always the same but the Alpha numeric characters can be anywhere between "-" and "CSN" with differing lengths.

PSN932-49837056-DELAIR09364CSN/SF-66/25JAN0541

PSN932-51231434-H1001865CSN/SF-5/25JAN0546

PSN932-52648256-2EGA814CSN/SF-10/25JAN0549

Taking the first sample data (PSN932-52506252-BRS14774366CSN/SF-1/25JAN0524) I created the below query that outputs the data correctly, however this query doesn't consider that the text can consist of both Alpha/numeric characters with a varying length between 1-12 characters

SELECT REGEXP_SUBSTR('PSN932-52506252-BRS14774366CSN/SF-1/25JAN0524', '-(\D\D\D\d\d\d\d\d\d\d\d)',1 ,1, 'i',1) "REGEXP_SUBSTR" FROM DUAL;

output of above query is as follows:

BRS14774366

Can anyone tell me how to format the match pattern in the query so I can consistently extract the data between "-" and "CSN"?

As always appreciate any assistance people can provide?

Update - it seems the data is stored containing carriage returns so the below query doesn't work:

SELECT REGEXP_SUBSTR('PSN 932-52506252-BRS14774366 CSN/SF-1/25JAN0524', '-(\w+)CSN', 1, 1, 'i', 1) "REGEXP_SUBSTR" FROM DUAL;

works fine if the data was like this:

SELECT

REGEXP_SUBSTR('PSN932-52506252-BRS14774366CSN/SF-1/25JAN0524', '-(\w+)CSN', 1, 1, 'i', 1) "REGEXP_SUBSTR" FROM DUAL;

Can this function deal with carriage returns?

Upvotes: 0

Views: 888

Answers (5)

user14999956
user14999956

Reputation: 85

Even though the query looks ugly, just posting an alternative to regexp related. But the text needs to be duplicated to get the desired result.

SELECT
    substr('shusduhash-basb'
           || CHR(10)
           || 'daks-jsbabsCSN/', instr('shusduhash-basb'
                                       || CHR(10)
                                       || 'daks-jsbabsCSN/',
                                       '-',
                                       1,
                                       2) + 1,
           instr('shusduhash-basb'
                 || CHR(10)
                 || 'daks-jsbabsCSN/', 'CSN/', 1, 1) - instr('shusduhash-basb'
                                                             || CHR(10)
                                                             || 'daks-jsbabsCSN/', '-', 1, 2) - 1) x
FROM
    dual;

Upvotes: 0

Patrick Bacon
Patrick Bacon

Reputation: 4660

Use the alternation operator to account for the carriage return:

  SELECT
    regexp_substr('PSN932-52506252-BRS1477'
    || CHR(13)
    || '4366CSN/SF-1/25JAN0524','(([[:alnum:]]|['
    || CHR(13)
    || '])+)CSN') "REGEXP_SUBSTR"
FROM
    dual;

Upvotes: 0

Gasparen
Gasparen

Reputation: 81

You can use \w to match any alphanumeric character

Oracle docs

\w A word character, which is defined as an alphanumeric or underscore () character. It is equivalent to the POSIX class [[:alnum:]]. Note that if you do not want to include the underscore character, you can use the POSIX class [[:alnum:]].

So the pattern should be changed to -(\w+)CSN

Removing the newlines is probably done easiest by replacing the linefeed/carrige return characters.

WITH s AS (select 'SN932-52506252-BRS14774366CSN/SF-1/25JAN0524' n from dual union all
       select 'PSN932-49837056-DELAIR09364' || chr(10) || 'CSN/SF-66/25JAN0541' from dual union all
       select 'PSN932-51231434-H1001865CSN/SF-5/25JAN0546' from dual union all
       select 'PSN932-52648256-2EGA814' || chr(13) || 'CSN/SF-10/25JAN0549' from dual),
remove_newlines as (select replace(replace(s.n, chr(10), ''), chr(13), '') n from s)

SELECT regexp_substr(s.n, '-(\w+)CSN', 1, 1, 'i', 1) "REGEXP_SUBSTR" FROM remove_newlines s;

Upvotes: 1

Aleksej
Aleksej

Reputation: 22969

You can avoid regular expressions for this, by combining instr and substr.

This may be not so easy to read, but usually performs better then regexp solutions.

 with test(x) as (
        select 'PSN932-49837056-DELAIR09364CSN/SF-66/25JAN0541' from dual union all
        select 'PSN932-51231434-H1001865CSN/SF-5/25JAN0546' from dual union all
        select 'PSN932-52648256-2EGA814CSN/SF-10/25JAN0549' from dual
    )
    select substr(
                    substr(x, 1, instr(x, 'CSN') -1),
                    instr(
                            substr(x, 1, instr(x, 'CSN') -1),
                            '-',
                            -1
                         )+1
                 )
    from test

This gets the part up to CSN :

substr(substr(x, 1, instr(x, 'CSN') -1)

and then gets the substring of this part starting from the last '-' :

instr(substr(x, 1, instr(x, 'CSN') -1), '-',-1)+1

Upvotes: 0

Sanders the Softwarer
Sanders the Softwarer

Reputation: 2496

Is it what you're looking for?

SQL> with
  2    s as (select 'SN932-52506252-BRS14774366CSN/SF-1/25JAN0524' n from dual union all
  3          select 'PSN932-49837056-DELAIR09364CSN/SF-66/25JAN0541' from dual union all
  4          select 'PSN932-51231434-H1001865CSN/SF-5/25JAN0546' from dual union all
  5          select 'PSN932-52648256-2EGA814CSN/SF-10/25JAN0549' from dual)
  6  select
  7    substr(replace(regexp_substr(s.n, '-([[:alpha:]]|[[:digit:]])+CSN'), 'CSN'), 2)
  8  from s;

SUBSTR(REPLACE(REGEXP_SUBSTR(S
--------------------------------------------------------------------------------
BRS14774366
DELAIR09364
H1001865
2EGA814

Upvotes: 1

Related Questions