Reputation: 10147
what would be the best way to do a word match search in PL/SQL?
E.g. for the string "BROUGHTONS OF CHELTENHAM LIMITED"
"BROUGHTONS LIMITED" is a match
"OF LIMITED" is a match
"CHELTENHAM BROUGHTONS" is a match
"BROUG" is a non-match
Upvotes: 0
Views: 1799
Reputation: 60262
Use an Oracle Text index. This will allow you to issue powerful CONTAINS queries.
http://docs.oracle.com/cd/B28359_01/text.111/b28303/quicktour.htm
Upvotes: 0
Reputation: 15473
Here's a rather crude approach, but should do what you are asking. As Xophmeister noted, you probably need to tokenize each string and then search the tokens (since you want to match out of order, doing a simple "like %tokenA%tokenB%tokenC%" won't work).
Also, this doesn't even touch all the issues around phonetics, soundex, etc. But again, not what you asked. This also doesn't touch performance or scaling issues, and would probably only be acceptable for a small set of data.
So, first we need a split function:
create or replace
function fn_split(i_string in varchar2, i_delimiter in varchar2 default ',', b_dedup_tokens in number default 0)
return sys.dbms_debug_vc2coll
as
l_tab sys.dbms_debug_vc2coll;
begin
select regexp_substr(i_string,'[^' || i_delimiter || ']+', 1, level)
bulk collect into l_tab
from dual
connect by regexp_substr(i_string, '[^' || i_delimiter || ']+', 1, level) is not null
order by level;
if (b_dedup_tokens > 0) then
return l_tab multiset union distinct l_tab;
end if;
return l_tab;
end;
Now we can use it to inspect strings for specific tokens. Here I'm searching for 3 tokens (John Q Public) from a sample set of data
with test_data as (
select 1 as id, 'John Q Public' as full_name from dual
union
select 2 as id, 'John John Smith' as full_name from dual
union
select 3 as id,'Sally Smith' from dual
union
select 4 as id, 'Mr John B B Q Public' from dual
union
select 5 as id, 'A Public John' from dual
)
select d.id, d.full_name, count(1) as hits
from test_data d, table(fn_split(full_name, ' ', 1))
-- should have at least 1 of these tokens
where column_value in ('John', 'Q', 'Public')
group by d.id, d.full_name
-- can also restrict results to those with at least x token hits
having count(1) >= 2
-- most hits at top of results
order by count(1) desc, id asc
Output:
"ID" "FULL_NAME" "HITS"
1 "John Q Public" 3
4 "Mr John B B Q Public" 3
5 "A Public John" 2
You can also add "upper" to make case insensitive, etc.
Upvotes: 4