Ruslan
Ruslan

Reputation: 10147

Word Match Search with PL/SQL

what would be the best way to do a word match search in PL/SQL?

E.g. for the string "BROUGHTONS OF CHELTENHAM LIMITED"

"BROUGHTONS LIMITED" is a match

"OF LIMITED" is a match

"CHELTENHAM BROUGHTONS" is a match

"BROUG" is a non-match

Upvotes: 0

Views: 1799

Answers (2)

Jeffrey Kemp
Jeffrey Kemp

Reputation: 60262

Use an Oracle Text index. This will allow you to issue powerful CONTAINS queries.

http://docs.oracle.com/cd/B28359_01/text.111/b28303/quicktour.htm

Upvotes: 0

tbone
tbone

Reputation: 15473

Here's a rather crude approach, but should do what you are asking. As Xophmeister noted, you probably need to tokenize each string and then search the tokens (since you want to match out of order, doing a simple "like %tokenA%tokenB%tokenC%" won't work).

Also, this doesn't even touch all the issues around phonetics, soundex, etc. But again, not what you asked. This also doesn't touch performance or scaling issues, and would probably only be acceptable for a small set of data.

So, first we need a split function:

create or replace 
function fn_split(i_string in varchar2, i_delimiter in varchar2 default ',', b_dedup_tokens in number default 0)
return sys.dbms_debug_vc2coll
as
  l_tab sys.dbms_debug_vc2coll;
begin
  select regexp_substr(i_string,'[^' || i_delimiter || ']+', 1, level)
  bulk collect into l_tab
  from dual
  connect by regexp_substr(i_string, '[^' || i_delimiter || ']+', 1, level) is not null
  order by level;

  if (b_dedup_tokens > 0) then
    return l_tab multiset union distinct l_tab;
  end if;
  return l_tab;
end;

Now we can use it to inspect strings for specific tokens. Here I'm searching for 3 tokens (John Q Public) from a sample set of data

with test_data as (
  select 1 as id, 'John Q Public' as full_name from dual
  union
  select 2 as id, 'John John Smith' as full_name from dual
  union
  select 3 as id,'Sally Smith' from dual
  union
  select 4 as id, 'Mr John B B Q Public' from dual
  union
  select 5 as id, 'A Public John' from dual
)
select d.id, d.full_name, count(1) as hits
from test_data d, table(fn_split(full_name, ' ', 1))
-- should have at least 1 of these tokens
where column_value in ('John', 'Q', 'Public')
group by d.id, d.full_name
-- can also restrict results to those with at least x token hits
having count(1) >= 2
-- most hits at top of results
order by count(1) desc, id asc

Output:

"ID"    "FULL_NAME" "HITS"
1   "John Q Public" 3
4   "Mr John B B Q Public"  3
5   "A Public John" 2

You can also add "upper" to make case insensitive, etc.

Upvotes: 4

Related Questions