simulacrum
simulacrum

Reputation: 3

Replacing everything but a specific pattern in a string (Oracle)

I want to replace everything in a string with '' except for a given pattern using Oracle's regexp_replace.

In my case the pattern refers to German licence plates. The patterns is contained in the usage column (verwendungszweck_bez) of a revenue table (of a bank). The pattern can be matched by ([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4}). Now I'd like to reverse the matching pattern in order to match everything except for the pattern. The usage column looks like this:

ALLIANZ VERSICHERUNGS-AG VERTRAG AS-9028000568 KFZ-VERSICHERUNG KFZ-VERS. XX-Y 427 01.01.19 - 31.12.19

XX-Y 427 would be the pattern I'm interested in. The string can contain more than one license plate:

AXA VERSICHERUNG AG 40301089910 KFZ HAFTPFLICHT ABC-RM10 37,35 + 40330601383 KFZ HAFTPFLIVHT ABC-LX 283 21,19

In this case I need ABC-RM10 and ABC-LX 283.

So far I just replace everything from the string with regexp_replace:

regexp_replace(lower(a.verwendungszweck_bez),'^(.*?)kfz','')

because there's always 'kfz' in the string and the licence plate information follows (not necessarily direct) after that.

upper(regexp_replace(regexp_substr(regexp_replace(lower(a.verwendungszweck_bez),'(^(.*?)kfz',''),'([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4})',1,1),'([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4})','\1-\2 \3'))

This works but I'm sure there's a better solution.

The result should be a list of customers, licence plates and count of cars like this:

Customer|licence plates         |count

1234567 |XX-Y 427| 1
1255599 |ABC-RM 10 + ABC-LX 283| 2

Upvotes: 0

Views: 475

Answers (1)

MT0
MT0

Reputation: 168751

You can use a recursive sub-query to find the items. Also, you can use UPPER and TRANSLATE to normalise the data to remove the optional separators in the number plates and convert it into a single case:

Test Data:

CREATE TABLE test_data ( value ) AS
SELECT 'ALLIANZ VERSICHERUNGS-AG VERTRAG AS-9028000568 KFZ-VERSICHERUNG KFZ-VERS. XX-Y 427 01.01.19 - 31.12.19' FROM DUAL UNION ALL
-- UNG AG 4030 should not match
SELECT 'AXA VERSICHERUNG AG 40301089910 KFZ HAFTPFLICHT ABC-RM10 37,35 + 40330601383 KFZ HAFTPFLIVHT ABC-LX 283 21,19' FROM DUAL UNION ALL
-- Multiple matches adjacent to each other
SELECT 'AA-A1BB-BB222CC C3333' FROM DUAL UNION ALL
-- Duplicate values with different separators and cases
SELECT 'AA-A1 AA-A 1 aa a1' FROM DUAL

Query:

WITH items ( value, item, next_pos ) AS (
  SELECT value,
         TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 'i', 2 ) ), '_ -', '_' ),
         REGEXP_INSTR(  value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 1, 'i', 2 ) - 1
  FROM   test_data
UNION ALL
  SELECT value,
         TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 'i', 2 ) ), '_ -', '_' ),
         REGEXP_INSTR(  value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 1, 'i', 2 ) - 1
  FROM   items
  WHERE  next_pos > 0
)
SELECT item,
       COUNT(*)
FROM   items
WHERE  item IS NOT NULL AND NEXT_POS > 0
GROUP BY item

Output:

ITEM     | COUNT(*)
:------- | -------:
CCC3333  |        1
AAA1     |        4
XXY427   |        1
ABCRM10  |        1
ABCLX283 |        1
BBBB222  |        1

db<>fiddle here

The result should be a list of customers ...

You haven't given any information on how customers relate to this; that part is left as an exercise to the reader (who hopefully has the client values somewhere and can correlate them to the input).


Update:

If you want the count of unique number plates per row then:

WITH items ( rid, value, item, next_pos ) AS (
  SELECT ROWID,
         value,
         TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 'i', 2 ) ), '_ -', '_' ),
         REGEXP_INSTR(  value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 1, 'i', 2 ) - 1
  FROM   test_data
UNION ALL
  SELECT rid,
         value,
         TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 'i', 2 ) ), '_ -', '_' ),
         REGEXP_INSTR(  value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 1, 'i', 2 ) - 1
  FROM   items
  WHERE  next_pos > 0
)
SELECT LISTAGG( item, ' + ' ) WITHIN GROUP ( ORDER BY item ) AS items,
       COUNT(*)
FROM   (
  SELECT DISTINCT
         rid,
         item
  FROM   items
  WHERE  item IS NOT NULL AND NEXT_POS > 0
)
GROUP BY rid;

Which outputs:

ITEMS                    | COUNT(*)
:----------------------- | -------:
XXY427                   |        1
ABCLX283 + ABCRM10       |        2
AAA1 + BBBB222 + CCC3333 |        3
AAA1                     |        1

db<>fiddle here

Upvotes: 1

Related Questions