Gerry Gry
Gerry Gry

Reputation: 182

Oracle SQL: Extracting multiple text between two characters

i have table like below :

|-------------|---------------------------------------------------|
|ID.          | CONTENT                                           |
|-------------|---------------------------------------------------|
|1            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|2            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|3            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content. <SUB-TITLE-3> Content
|-------------|---------------------------------------------------|

I want to extract all text in between <>, so it will become like below :

|-------------|-------------------------------------------------|
|ID.          | CONTENT                                         |
|-------------|-------------------------------------------------|
|1            |TITLE                                            |
|1            |SUB-TITLE-1                                      |
|1            |SUB-TITLE-2                                      |
|2            |TITLE                                            |
|2            |SUB-TITLE-1                                      |
|2            |SUB-TITLE-2                                      |
|3            |TITLE                                            |
|3            |SUB-TITLE-1                                      |
|3            |SUB-TITLE-2                                      | 
|3            |SUB-TITLE-3                                      |
|-------------|-------------------------------------------------|

How to achieve this ? I'm trying to do by regex, but I think I'm lost.. My Oracle version is 18c, if that's help...

Upvotes: 0

Views: 468

Answers (4)

Josh Eller
Josh Eller

Reputation: 2065

You can use the 4th argument of REGEXP_SUBSTR to specify an occurrence for matching.

To get a row for the 1st, 2nd, and 3rd occurrence, you can cross-join with a sub-query from dual.

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, s.match_occurrence, 'i', 1) AS content_match
FROM test_data t
CROSS JOIN (
    SELECT 1 AS match_occurrence FROM dual UNION
    SELECT 2 AS match_occurrence FROM dual UNION
    SELECT 3 AS match_occurrence FROM dual UNION
    SELECT 4 AS match_occurrence FROM dual
    /* ... etc, with the number of rows equal to the maximum number of matches that can appear */
) s
WHERE REGEXP_SUBSTR(t.content_data, '<.*?>', 1, s.match_occurrence) IS NOT NULL /* Only return records that have a match for the given occurrence */
ORDER BY t.content_id, s.match_occurrence

Borrowing the CONNECT_BY_LEVEL from Barbaros' excellent answer, you could do it more concisely as:

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, LEVEL, 'i', 1) AS content_match
FROM test_data t
CONNECT BY 
   LEVEL <= REGEXP_COUNT(t.content_data, '<.*?>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR content_id = content_id
ORDER BY t.content_id, LEVEL

Note that the CONNECT_BY_LEVEL method might be slower on large datasets, so I would avoid that if performance is a concern.

Upvotes: 3

marcothesane
marcothesane

Reputation: 6721

Perfect approach, @Josh Eller !

Only that @Gerry Gry needs it without the greater/smaller signs.

Try with grouping using parentheses?

WITH test_data(content_id,content_data) AS (
      SELECT 1 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<SUB-TITLE-3>Content.' FROM dual
UNION SELECT 2 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' FROM dual
)

SELECT t.content_id
     , match_occurrence
     , REGEXP_SUBSTR(
         t.content_data     -- input string
       , '[<]([^>]*)[>]'    -- regex
       , 1                  -- starting position
       , s.match_occurrence -- n-th occurrence
       , ''                 -- regexp modifier
       , 1                  -- captured-subexp
       ) AS content_match
FROM test_data t
CROSS JOIN (
      SELECT 1 FROM dual 
UNION SELECT 2 FROM dual 
UNION SELECT 3 FROM dual 
UNION SELECT 4 FROM dual
) s(match_occurrence)
WHERE 
  REGEXP_SUBSTR(
         t.content_data     -- input string
       , '[<]([^>]*)[>]'    -- regex
       , 1                  -- starting position
       , s.match_occurrence -- n-th occurrence
       , ''                 -- regexp modifier
       , 1                  -- captured-subexp
  )  
  IS NOT NULL
ORDER BY t.content_id, s.match_occurrence
;
-- out Time: First fetch (0 rows): 0.656 ms. All rows formatted: 0.667 ms
-- out  content_id | match_occurrence | content_match 
-- out ------------+------------------+---------------
-- out           1 |                1 | TITLE
-- out           1 |                2 | SUB-TITLE-1
-- out           1 |                3 | SUB-TITLE-2
-- out           1 |                4 | SUB-TITLE-3
-- out           2 |                1 | TITLE
-- out           2 |                2 | SUB-TITLE-1
-- out           2 |                3 | SUB-TITLE-2
-- out (7 rows)
-- out 
-- out Time: First fetch (7 rows): 29.904 ms. All rows formatted: 29.947 ms

Upvotes: 0

Barbaros &#214;zhan
Barbaros &#214;zhan

Reputation: 65208

One option would be using instr() and substr() functions together within a

SELECT .. FROM ..CONNECT BY level style query in order to repeat through counting the numbers of > (or <) signs within each strings :

SELECT id, substr(content,
                  instr(content,'<',1,level)+1,
                  instr(content,'>',1,level)-instr(content,'<',1,level)-1) as content
  FROM tab
CONNECT BY level <= regexp_count(content,'>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR id = id

Demo

Upvotes: 3

Himanshu
Himanshu

Reputation: 3970

I had tried the more conventional way using SUBSTR and INSTR

     With data as 
      Select column1, 
      Trim(CONTENT, '<', '>') as col2 FROM
         TABLE
    WITH subdata as
   ( Select column1, 
     SUBSTR(col2,0, INSTR(col2, ' ')) 
    as s1
     from 
     data) t1
     Union
     ( Select t1.column1 as col1,
     SUBSTR(col2, Length(t1.s1)+1
           INSTR(
           SUBSTR(
            t1.col2, Length(t1.s1)+1,  
           LENGTH(col2)), ' ')))  as col2
      From
     data) t3
      Union
       ........ t3.... t4

      From table

Upvotes: 0

Related Questions