Reputation: 182
i have table like below :
|-------------|---------------------------------------------------|
|ID. | CONTENT |
|-------------|---------------------------------------------------|
|1 |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|2 |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|3 |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content. <SUB-TITLE-3> Content
|-------------|---------------------------------------------------|
I want to extract all text in between <>, so it will become like below :
|-------------|-------------------------------------------------|
|ID. | CONTENT |
|-------------|-------------------------------------------------|
|1 |TITLE |
|1 |SUB-TITLE-1 |
|1 |SUB-TITLE-2 |
|2 |TITLE |
|2 |SUB-TITLE-1 |
|2 |SUB-TITLE-2 |
|3 |TITLE |
|3 |SUB-TITLE-1 |
|3 |SUB-TITLE-2 |
|3 |SUB-TITLE-3 |
|-------------|-------------------------------------------------|
How to achieve this ? I'm trying to do by regex, but I think I'm lost.. My Oracle version is 18c, if that's help...
Upvotes: 0
Views: 468
Reputation: 2065
You can use the 4th argument of REGEXP_SUBSTR
to specify an occurrence for matching.
To get a row for the 1st, 2nd, and 3rd occurrence, you can cross-join with a sub-query from dual.
WITH test_data AS (
SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)
SELECT t.content_id,
REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, s.match_occurrence, 'i', 1) AS content_match
FROM test_data t
CROSS JOIN (
SELECT 1 AS match_occurrence FROM dual UNION
SELECT 2 AS match_occurrence FROM dual UNION
SELECT 3 AS match_occurrence FROM dual UNION
SELECT 4 AS match_occurrence FROM dual
/* ... etc, with the number of rows equal to the maximum number of matches that can appear */
) s
WHERE REGEXP_SUBSTR(t.content_data, '<.*?>', 1, s.match_occurrence) IS NOT NULL /* Only return records that have a match for the given occurrence */
ORDER BY t.content_id, s.match_occurrence
Borrowing the CONNECT_BY_LEVEL
from Barbaros' excellent answer, you could do it more concisely as:
WITH test_data AS (
SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)
SELECT t.content_id,
REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, LEVEL, 'i', 1) AS content_match
FROM test_data t
CONNECT BY
LEVEL <= REGEXP_COUNT(t.content_data, '<.*?>')
AND PRIOR sys_guid() IS NOT NULL
AND PRIOR content_id = content_id
ORDER BY t.content_id, LEVEL
Note that the CONNECT_BY_LEVEL
method might be slower on large datasets, so I would avoid that if performance is a concern.
Upvotes: 3
Reputation: 6721
Perfect approach, @Josh Eller !
Only that @Gerry Gry needs it without the greater/smaller signs.
Try with grouping using parentheses?
WITH test_data(content_id,content_data) AS (
SELECT 1 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<SUB-TITLE-3>Content.' FROM dual
UNION SELECT 2 , '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' FROM dual
)
SELECT t.content_id
, match_occurrence
, REGEXP_SUBSTR(
t.content_data -- input string
, '[<]([^>]*)[>]' -- regex
, 1 -- starting position
, s.match_occurrence -- n-th occurrence
, '' -- regexp modifier
, 1 -- captured-subexp
) AS content_match
FROM test_data t
CROSS JOIN (
SELECT 1 FROM dual
UNION SELECT 2 FROM dual
UNION SELECT 3 FROM dual
UNION SELECT 4 FROM dual
) s(match_occurrence)
WHERE
REGEXP_SUBSTR(
t.content_data -- input string
, '[<]([^>]*)[>]' -- regex
, 1 -- starting position
, s.match_occurrence -- n-th occurrence
, '' -- regexp modifier
, 1 -- captured-subexp
)
IS NOT NULL
ORDER BY t.content_id, s.match_occurrence
;
-- out Time: First fetch (0 rows): 0.656 ms. All rows formatted: 0.667 ms
-- out content_id | match_occurrence | content_match
-- out ------------+------------------+---------------
-- out 1 | 1 | TITLE
-- out 1 | 2 | SUB-TITLE-1
-- out 1 | 3 | SUB-TITLE-2
-- out 1 | 4 | SUB-TITLE-3
-- out 2 | 1 | TITLE
-- out 2 | 2 | SUB-TITLE-1
-- out 2 | 3 | SUB-TITLE-2
-- out (7 rows)
-- out
-- out Time: First fetch (7 rows): 29.904 ms. All rows formatted: 29.947 ms
Upvotes: 0
Reputation: 65208
One option would be using instr()
and substr()
functions together within a
SELECT .. FROM ..CONNECT BY level
style query in order to repeat through counting the numbers of >
(or <
) signs within each strings :
SELECT id, substr(content,
instr(content,'<',1,level)+1,
instr(content,'>',1,level)-instr(content,'<',1,level)-1) as content
FROM tab
CONNECT BY level <= regexp_count(content,'>')
AND PRIOR sys_guid() IS NOT NULL
AND PRIOR id = id
Upvotes: 3
Reputation: 3970
I had tried the more conventional way using SUBSTR
and INSTR
With data as
Select column1,
Trim(CONTENT, '<', '>') as col2 FROM
TABLE
WITH subdata as
( Select column1,
SUBSTR(col2,0, INSTR(col2, ' '))
as s1
from
data) t1
Union
( Select t1.column1 as col1,
SUBSTR(col2, Length(t1.s1)+1
INSTR(
SUBSTR(
t1.col2, Length(t1.s1)+1,
LENGTH(col2)), ' '))) as col2
From
data) t3
Union
........ t3.... t4
From table
Upvotes: 0