Reputation: 28316
I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1
in the substring .11.15.
What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Upvotes: 1
Views: 112
Reputation: 4660
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115
included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+
will match 11.15
as follow:
SCOTT@DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1
).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2
.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3
. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT@db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).
Upvotes: 0
Reputation: 30991
Your regex caught:
1
- the second digit in 11
,1
- the first digit in 15
.So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence of digits would be to use:
But as I am not sure whether Oracle supports lookarounds, I wrote the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.)
- Either start of the string or a dot (group 1), instead of
the loobehind.(\d+)
- A sequence of digits (group 2).(
- Start of group 3, containing:\.(\2)
- A dot and the same sequence of digits which caught group 2.)+
- End of group 3, it may occur multiple times.Upvotes: 1
Reputation: 48731
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b
token nor ancient [[:<:]]
or [[:>:]]
.
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Upvotes: 1
Reputation: 31666
Oracle's REGEX
does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE
, which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Upvotes: 3