DenaliHardtail
DenaliHardtail

Reputation: 28316

Why is this regex performing partial matches?

I have the following raw data:

1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...

I'm using this regex to remove duplicates:

([^.]+)(.[ ]*\1)+

which results in the following:

1.2.4.5.9.115.16.19 ...

The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?

The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.

The regex is being used within Oracle's REGEXP_REPLACE

The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.

Upvotes: 1

Views: 112

Answers (4)

Patrick Bacon
Patrick Bacon

Reputation: 4660

Group the repeating pattern and remove it

As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):

([^.]+)(.[ ]*\1)+ will match 11.15 as follow:

SCOTT@DB>SELECT
  2      '11.15' val,
  3      regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
  4  FROM
  5      dual;
VAL     DEDUPLICATED
11.15   115

Here is a similar approach to address those problems:

matching pattern composition

-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).

'19' which matches ([^.]*)

-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.

'19.19.19' which matches ([^.]*)([.]\1)+

-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.

([.]|$)

replacement string

I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.

\1\3

Solution

regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')

Here is an example using some permutations of your examples:

SCOTT@db>WITH tst AS (
  2      SELECT
  3          '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
  4      FROM
  5          dual
  6      UNION ALL
  7      SELECT
  8          '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
  9      FROM
 10          dual
 11      UNION ALL
 12      SELECT
 13          '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
 14      FROM
 15          dual
 16  ) SELECT
 17      val,
 18      regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate 
 19      FROM
 20      tst;
VAL                                                DEDUPLICATE             
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19                 1.2.4.5.9.11.15.16.19   
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19   1.2.4.5.9.11.15.16.19   
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19           1.2.4.5.9.11.15.16.19   

My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Upvotes: 0

Valdi_Bo
Valdi_Bo

Reputation: 30991

Your regex caught:

  • a 1 - the second digit in 11,
  • then a dot,
  • and finally 1 - the first digit in 15.

So your regex failed to catch the whole sequence of digits.

The most natural way to write a regex catching the whole sequence of digits would be to use:

  • a loobehind for either the start of the string or a dot,
  • then catch a sequence of digits,
  • and finally a lookahead for a dot.

But as I am not sure whether Oracle supports lookarounds, I wrote the regex another way:

(^|\.)(\d+)(\.(\2))+

Details:

  • (^|\.) - Either start of the string or a dot (group 1), instead of the loobehind.
  • (\d+) - A sequence of digits (group 2).
  • ( - Start of group 3, containing:
  • \.(\2) - A dot and the same sequence of digits which caught group 2.
  • )+ - End of group 3, it may occur multiple times.

Upvotes: 1

revo
revo

Reputation: 48731

Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].

But on this specific set you can use:

(\d+\.)(\1)+

Note: You forgot to escape dot.

Upvotes: 1

Kaushik Nayak
Kaushik Nayak

Reputation: 31666

Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.

SELECT LISTAGG(n, '.') WITHIN
GROUP (
        ORDER BY n
        ) AS n
FROM (
    SELECT DISTINCT TO_NUMBER(column_value) AS n
    FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
    );

Demo

Upvotes: 3

Related Questions