Reputation: 3
I am trying to work out in Oracle how to isolate/highlight word combinations in a concatenated string like the one below:
Some words##Again words##More of this||@@@@||Some words##Again words##Other
The idea is to find the word combinations that appear exactly twice and replace them by 0 so I'm left with the ones that appear only once, either on the left side of the ||@@@@||
or on the right side. The result of the query should be something like this:
Highlighted
Some words##Again words##More of this||@@@@||Some words##Again words##**Other**
Replaced
0##0##More of this||@@@@||0##0##Other
To give you some more information about the concatenation: the left side (before the ||@@@@||
) is my current customer record, while on the right hand side I have the previous version. By making the replacements I can reveal any differences between customer records.
I have tried to get this done by using:
REGEXP_REPLACE(MY STRING,'((Some words){1,2})|((Again words){1,2})','0',1,0)
as for some reason the string parts in my first record are never correctly replaced. I'm also hitting the limits of this function due to the number of word combinations I need to match;Some more information in order to find a solid, performant solution:
I hope this is clear. Anyone with some magical ideas?
Thanks in advance
Upvotes: 0
Views: 181
Reputation: 168741
You can use a recursive sub-query factoring clause to replace one duplicated term at each iteration:
WITH replaced ( value, start_char ) AS (
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|@@@@\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|@@@@\|\|([^#]+?##)*)\2(##|$)',
1
)
FROM table_name
UNION ALL
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|@@@@\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
start_char + 1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|@@@@\|\|([^#]+?##)*)\2(##|$)',
start_char + 1
)
FROM replaced
WHERE start_char > 0
)
SELECT value
FROM replaced
WHERE start_char = 0;
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'Some words##Again words##More of this||@@@@||Some words##Again words##Other' FROM DUAL UNION ALL
SELECT '333##123##789##555||@@@@||123##456##789##222##333' FROM DUAL;
Outputs:
| VALUE | | :------------------------------------ | | 0##0##More of this||@@@@||0##0##Other | | 0##0##0##555||@@@@||0##456##0##222##0 |
db<>fiddle here
The regular expression matches:
(##|^)
either two #
characters or the start of the string ^
(in the first capturing group ()
);([^#]+?)
one-or-more characters that are not #
(in the second capturning group ()
);(
the start of the 3rd capturing group;
(##[^#]+?)*
two #
characters followed by one-or-more non-#
characters (in the 4th capturing group ()
) all repeated zero-or-more *
times;\|\|@@@@\|\|
then two |
characters, four @
characters and two |
characters;([^#]+?##)*
then one-of-more non-#
characters followed by two #
characters (in the 5th capturing group ()
);)
the end of the 3rd capturing group;\2
a duplicate of the 2nd capturing group; then(##|$)
either two #
characters or the end-of-the-string $
(in the 6th capturing group).This is replaced by:
\10\30\6
which is the contents of the 1st capturing group then a zero (replacing the 2nd capturing group) then the contents of the 3rd capturing group then a second zero (replacing the matched duplicate) then the contents of the 6th capturing group.The query will replace a pair of duplicate terms in the string (if they exist) and REGEXP_INSTR
will find the start of the match and put the values into value
and start_char
(respectively); then at the next iteration the regular expression will start looking from the next character on from the start of the previous match, so that it will gradually move across the string finding matches until no more duplicate terms can be found and REGEXP_REPLACE
will not perform a replacement and REGEXP_INSTR
will return 0
and the iteration will terminate.
The final query filters to return the only the final level of the iteration (when all the duplicates have been replaced).
Upvotes: 1