Reputation: 43
I apologize if this question has been asked, I'm having trouble putting it into words.
I've been asked to filter out rows in a query where text from one field is contained in another field. An example would probably explain it better:
Column_1 Column_2
Low Static Static
Static Static
Static Clear
Static
Very Low Freq Freq
The result of the query should return only rows 3 and 4, since rows 1, 2, and 5 contain strings that are similar. Right now, I have the following condition:
WHERE
((Column_2 NOT LIKE '%' || Column_1 || '%')
OR (Column_1 NOT LIKE '%' || Column_2 || '%' OR Column_1 IS NULL))
However, it's returning rows 1, 3, 4, and 5 when I want to only return rows 3 and 4. This is just example data, my actual dataset contains many different text strings in columns 1 and 2, so I can't just write specific case statements to exlcude certain instances where the columns are similar.
Maybe this just isn't possible, since I'm unable to define a string as something contained within 2 spaces, while at the same time taking into consideration cases where there are no spaces?
Thanks
Upvotes: 3
Views: 19267
Reputation: 20842
The approach you are going with is going to do full table scans so it wont scale as the table grows. If you want to implement a more efficient solution (without using Oracle large text indexing) that will use an index, use a function based index to pre-calculate the columns common substrings.
Using INSTR() you can find whether a column is a substring of another column, and return a score for that. 0 means no match.
create index ix_t_score on t (instr(nvl(column_1,' '), nvl(column_2, ' ')),
instr(nvl(column_2,' '), nvl(column_1, ' ')));
Now write the query such that it allows Oracle to use the indexes.
-- Find rows that don't have common strings
select * from t
where instr(nvl(column_1, ' '), nvl(column_2, ' ')) = 0 and
instr(nvl(column_2, ' '), nvl(column_1, ' ')) = 0;
-- Find rows that do
select * from t
where instr(nvl(column_1, ' '), nvl(column_2, ' ')) > 0 or
instr(nvl(column_2, ' '), nvl(column_1, ' ')) > 0;
set autotrace on
Execution Plan
----------------------------------------------------------
Plan hash value: 4100696360
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 2 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 22 | | |
|* 2 | INDEX RANGE SCAN| IX_T_SCORE | 1 | 22 | 2 (0)| 00:00:01 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(INSTR(NVL("COLUMN_1",' '),NVL("COLUMN_2",' '))=0 AND
INSTR(NVL("COLUMN_2",' '),NVL("COLUMN_1",' '))=0)
You can simplify it by creating a deterministic stored procedure / function to return a score, and the SQL becomes much simpler than the above. The use of NVL() is to take care of columns with nulls.
Upvotes: 1
Reputation: 1269853
For your expression, I think you want and
rather than or
:
WHERE ((Column_2 NOT LIKE '%' || Column_1 || '%') AND
(Column_1 NOT LIKE '%' || Column_2 || '%' OR Column_1 IS NULL)
)
You need for both conditions to be true. You might find the logic easier to follow as:
WHERE NOT (Column_2 LIKE '%' || Column_1 || '%' OR
Column_1 LIKE '%' || Column_2 || '%'
)
Upvotes: 3