Reputation: 3
I want to replace everything in a string with '' except for a given pattern using Oracle's regexp_replace
.
In my case the pattern refers to German licence plates. The patterns is contained in the usage column (verwendungszweck_bez
) of a revenue table (of a bank). The pattern can be matched by ([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4}
). Now I'd like to reverse the matching pattern in order to match everything except for the pattern.
The usage column looks like this:
ALLIANZ VERSICHERUNGS-AG VERTRAG AS-9028000568 KFZ-VERSICHERUNG KFZ-VERS. XX-Y 427 01.01.19 - 31.12.19
XX-Y 427
would be the pattern I'm interested in. The string can contain more than one license plate:
AXA VERSICHERUNG AG 40301089910 KFZ HAFTPFLICHT ABC-RM10 37,35 + 40330601383 KFZ HAFTPFLIVHT ABC-LX 283 21,19
In this case I need ABC-RM10
and ABC-LX 283
.
So far I just replace everything from the string with regexp_replace
:
regexp_replace(lower(a.verwendungszweck_bez),'^(.*?)kfz','')
because there's always 'kfz
' in the string and the licence plate information follows (not necessarily direct) after that.
upper(regexp_replace(regexp_substr(regexp_replace(lower(a.verwendungszweck_bez),'(^(.*?)kfz',''),'([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4})',1,1),'([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4})','\1-\2 \3'))
This works but I'm sure there's a better solution.
The result should be a list of customers, licence plates and count of cars like this:
Customer|licence plates |count
1234567 |XX-Y 427| 1
1255599 |ABC-RM 10 + ABC-LX 283| 2
Upvotes: 0
Views: 475
Reputation: 168751
You can use a recursive sub-query to find the items. Also, you can use UPPER
and TRANSLATE
to normalise the data to remove the optional separators in the number plates and convert it into a single case:
Test Data:
CREATE TABLE test_data ( value ) AS
SELECT 'ALLIANZ VERSICHERUNGS-AG VERTRAG AS-9028000568 KFZ-VERSICHERUNG KFZ-VERS. XX-Y 427 01.01.19 - 31.12.19' FROM DUAL UNION ALL
-- UNG AG 4030 should not match
SELECT 'AXA VERSICHERUNG AG 40301089910 KFZ HAFTPFLICHT ABC-RM10 37,35 + 40330601383 KFZ HAFTPFLIVHT ABC-LX 283 21,19' FROM DUAL UNION ALL
-- Multiple matches adjacent to each other
SELECT 'AA-A1BB-BB222CC C3333' FROM DUAL UNION ALL
-- Duplicate values with different separators and cases
SELECT 'AA-A1 AA-A 1 aa a1' FROM DUAL
Query:
WITH items ( value, item, next_pos ) AS (
SELECT value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 1, 'i', 2 ) - 1
FROM test_data
UNION ALL
SELECT value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 1, 'i', 2 ) - 1
FROM items
WHERE next_pos > 0
)
SELECT item,
COUNT(*)
FROM items
WHERE item IS NOT NULL AND NEXT_POS > 0
GROUP BY item
Output:
ITEM | COUNT(*) :------- | -------: CCC3333 | 1 AAA1 | 4 XXY427 | 1 ABCRM10 | 1 ABCLX283 | 1 BBBB222 | 1
db<>fiddle here
The result should be a list of customers ...
You haven't given any information on how customers relate to this; that part is left as an exercise to the reader (who hopefully has the client values somewhere and can correlate them to the input).
Update:
If you want the count of unique number plates per row then:
WITH items ( rid, value, item, next_pos ) AS (
SELECT ROWID,
value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 1, 'i', 2 ) - 1
FROM test_data
UNION ALL
SELECT rid,
value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 1, 'i', 2 ) - 1
FROM items
WHERE next_pos > 0
)
SELECT LISTAGG( item, ' + ' ) WITHIN GROUP ( ORDER BY item ) AS items,
COUNT(*)
FROM (
SELECT DISTINCT
rid,
item
FROM items
WHERE item IS NOT NULL AND NEXT_POS > 0
)
GROUP BY rid;
Which outputs:
ITEMS | COUNT(*) :----------------------- | -------: XXY427 | 1 ABCLX283 + ABCRM10 | 2 AAA1 + BBBB222 + CCC3333 | 3 AAA1 | 1
db<>fiddle here
Upvotes: 1