Reputation: 937
I have a table with three columns structured as followed:
+------------------------+------------------------------+--------------+
| left | right | pattern |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Kiki Cola 50 ml bottle | |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | 50 ml Kiki Cola bottle | |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Kiki Cola 50 ml | |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Kiki Cola Light bottle 50 ml | |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Coca Cola 50 ml bottle | |
+------------------------+------------------------------+--------------+
Now I would like to perform an Oracle-SQL-query which gives me the edit-pattern of the two strings left and right. The result should be as followed:
+------------------------+------------------------------+--------------+
| left | right | pattern |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Kiki Cola 50 ml bottle | SAME |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | 50 ml Kiki Cola bottle | SWAPPED |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Kiki Cola 50 ml | CONTAINED_IN |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Kiki Cola Light bottle 50 ml | CONTAINS |
+------------------------+------------------------------+--------------+
| Kiki Cola 50 ml bottle | Coca Cola 50 ml bottle | NOT_SAME |
+------------------------+------------------------------+--------------+
Al my tries with REGEX_SPLIT and CONNECT BY were not successful. Do you have any ideas how to solve that problem?
Upvotes: 0
Views: 718
Reputation: 14848
with t( lt, rt) as (
select 'Kiki Cola 50 ml bottle', 'Kiki Cola 50 ml bottle' from dual union all
select 'Kiki Cola 50 ml bottle', '50 ml Kiki Cola bottle' from dual union all
select 'Kiki Cola 50 ml bottle', 'Kiki Cola 50 ml' from dual union all
select 'Kiki Cola 50 ml bottle', 'Kiki Cola Light bottle 50 ml' from dual union all
select 'Kiki Cola 50 ml bottle', 'Coca Cola 50 ml bottle' from dual ),
q as (select rownum rn, lt, rt,
'"'||replace(lt, ' ', '", "')||'"' ltx,
'"'||replace(rt, ' ', '", "')||'"' rtx from t )
select rn, lt, rt,
case when lt = rt then 'same'
when fl = 0 and fr = 0 then 'swapped'
when fl = 1 and fr = 0 then 'contains'
when fl = 0 and fr = 1 then 'contained in'
else 'not same'
end pattern
from (
select coalesce(l.rn, r.rn) rn,
max(case when l.rn is null then 1 else 0 end) fl,
max(case when r.rn is null then 1 else 0 end) fr
from (select rn, trim(column_value) lw from q, xmltable(ltx)) l
full join (select rn, trim(column_value) rw from q, xmltable(rtx)) r
on l.rn = r.rn and l.lw = r.rw
group by coalesce(l.rn, r.rn))
join q using (rn)
Result:
RN LT RT PATTERN
------ ---------------------- ---------------------------- ------------
1 Kiki Cola 50 ml bottle Kiki Cola 50 ml bottle same
2 Kiki Cola 50 ml bottle 50 ml Kiki Cola bottle swapped
3 Kiki Cola 50 ml bottle Kiki Cola 50 ml contained in
4 Kiki Cola 50 ml bottle Kiki Cola Light bottle 50 ml contains
5 Kiki Cola 50 ml bottle Coca Cola 50 ml bottle not same
Split string into words (here xml-way, connect by
works too, or function), make comparison using full join, count nulls, group by and show pattern using case when
.
Upvotes: 1
Reputation: 167962
You can create a collection data type:
CREATE TYPE stringlist IS TABLE OF VARCHAR2(200);
And then split the strings into collections of words and compare the collections:
SELECT left,
right,
CASE
WHEN left = right THEN 'same'
WHEN left_words = right_words THEN 'swapped'
WHEN left_words SUBMULTISET OF right_words THEN 'contains'
WHEN right_words SUBMULTISET OF left_words THEN 'contained in'
ELSE 'not_same'
END AS pattern
FROM (
SELECT left,
right,
( SELECT CAST(
COLLECT( REGEXP_SUBSTR( left, '[^ ]+', 1, LEVEL ) )
AS stringlist
)
FROM DUAL
CONNECT BY
LEVEL <= REGEXP_COUNT( left, '[^ ]+' )
) AS left_words,
( SELECT CAST(
COLLECT( REGEXP_SUBSTR( right, '[^ ]+', 1, LEVEL ) )
AS stringlist
)
FROM DUAL
CONNECT BY
LEVEL <= REGEXP_COUNT( right, '[^ ]+' )
) AS right_words
FROM test_data t
)
So for your test data:
CREATE TABLE test_data ( left, right ) AS
SELECT 'Kiki Cola 50 ml bottle', 'Kiki Cola 50 ml bottle' FROM DUAL UNION ALL
SELECT 'Kiki Cola 50 ml bottle', '50 ml Kiki Cola bottle' FROM DUAL UNION ALL
SELECT 'Kiki Cola 50 ml bottle', 'Kiki Cola 50 ml' FROM DUAL UNION ALL
SELECT 'Kiki Cola 50 ml bottle', 'Kiki Cola Light 50 ml bottle' FROM DUAL UNION ALL
SELECT 'Kiki Cola 50 ml bottle', 'Coca Cola 50 ml bottle' FROM DUAL;
The query outputs:
LEFT | RIGHT | PATTERN :--------------------- | :--------------------------- | :----------- Kiki Cola 50 ml bottle | Kiki Cola 50 ml bottle | same Kiki Cola 50 ml bottle | 50 ml Kiki Cola bottle | swapped Kiki Cola 50 ml bottle | Kiki Cola 50 ml | contained in Kiki Cola 50 ml bottle | Kiki Cola Light 50 ml bottle | contains Kiki Cola 50 ml bottle | Coca Cola 50 ml bottle | not_same
db<>fiddle here
Upvotes: 2