GrubBumble
GrubBumble

Reputation: 43

Look for text in one field contained in another field Oracle SQL

I apologize if this question has been asked, I'm having trouble putting it into words.

I've been asked to filter out rows in a query where text from one field is contained in another field. An example would probably explain it better:

    Column_1         Column_2
    Low Static       Static
    Static           Static
    Static           Clear
                     Static
    Very Low Freq    Freq

The result of the query should return only rows 3 and 4, since rows 1, 2, and 5 contain strings that are similar. Right now, I have the following condition:

    WHERE
    ((Column_2 NOT LIKE '%' || Column_1 || '%')
    OR (Column_1 NOT LIKE '%' || Column_2 || '%' OR Column_1 IS NULL))

However, it's returning rows 1, 3, 4, and 5 when I want to only return rows 3 and 4. This is just example data, my actual dataset contains many different text strings in columns 1 and 2, so I can't just write specific case statements to exlcude certain instances where the columns are similar.

Maybe this just isn't possible, since I'm unable to define a string as something contained within 2 spaces, while at the same time taking into consideration cases where there are no spaces?

Thanks

Upvotes: 3

Views: 19267

Answers (2)

mrjoltcola
mrjoltcola

Reputation: 20842

The approach you are going with is going to do full table scans so it wont scale as the table grows. If you want to implement a more efficient solution (without using Oracle large text indexing) that will use an index, use a function based index to pre-calculate the columns common substrings.

Using INSTR() you can find whether a column is a substring of another column, and return a score for that. 0 means no match.

create index ix_t_score on t (instr(nvl(column_1,' '), nvl(column_2, ' ')),
                              instr(nvl(column_2,' '), nvl(column_1, ' ')));

Now write the query such that it allows Oracle to use the indexes.

-- Find rows that don't have common strings
select * from t
  where instr(nvl(column_1, ' '), nvl(column_2, ' ')) = 0 and
        instr(nvl(column_2, ' '), nvl(column_1, ' ')) = 0;

-- Find rows that do
select * from t
  where instr(nvl(column_1, ' '), nvl(column_2, ' ')) > 0 or
        instr(nvl(column_2, ' '), nvl(column_1, ' ')) > 0;


set autotrace on


Execution Plan
----------------------------------------------------------
Plan hash value: 4100696360

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |     1 |    22 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |             |     1 |    22 |            |          |
|*  2 |   INDEX RANGE SCAN| IX_T_SCORE  |     1 |    22 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(INSTR(NVL("COLUMN_1",' '),NVL("COLUMN_2",' '))=0 AND
              INSTR(NVL("COLUMN_2",' '),NVL("COLUMN_1",' '))=0)

You can simplify it by creating a deterministic stored procedure / function to return a score, and the SQL becomes much simpler than the above. The use of NVL() is to take care of columns with nulls.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269853

For your expression, I think you want and rather than or:

WHERE ((Column_2 NOT LIKE '%' || Column_1 || '%') AND
       (Column_1 NOT LIKE '%' || Column_2 || '%' OR Column_1 IS NULL)
      )

You need for both conditions to be true. You might find the logic easier to follow as:

WHERE NOT (Column_2 LIKE '%' || Column_1 || '%' OR
           Column_1 LIKE '%' || Column_2 || '%'
          )

Upvotes: 3

Related Questions