Reputation: 175
I'm using a ctxsys.context index on one column to facilitate the Oracle Text full-text search feature. But having a problem when indexing numeric values that are separated by ',' or '.'.
I created the index like:
create index my_index on my_table(my_column)
indextype is ctxsys.context parameters ('SYNC (ON COMMIT)');
Then I insert four text documents:
insert into my_table (id, doc) values (1, 'FOO 300 BAR');
insert into my_table (id, doc) values (2, 'FOO 300 BAR 1,000.00');
insert into my_table (id, doc) values (3, 'FOO1FOO');
insert into my_table (id, doc) values (4, '1 FOO');
Now I would like to use the contains operator to search for 'FOO 300 BAR', '1,000.00' and a combination of both:
select score(1), id from my_table where contains(doc, 'FOO 300 BAR', 1) > 0;
select score(1), id from my_table where contains(doc, '1,000.00', 1) > 0;
select score(1), id from my_table where contains(doc, 'FOO 300 BAR 1,000.00', 1) > 0;
First one works as expected and I get both id 1 and 2 as a result. Although when I try to use 1,000.00 I get 0 rows as result.
As I read from the documentation it is using BASIC_LEXER as default. I also tried to specify the separators explicitly on the lexer and applied it to the index.
begin
ctx_ddl.create_preference('my_lex', 'BASIC_LEXER');
ctx_ddl.set_attribute('my_lex', 'numjoin', '.');
ctx_ddl.set_attribute('my_lex', 'numgroup', ',');
end;
create index my_index on my_table(doc)
indextype is ctxsys.context parameters ('SYNC (ON COMMIT) LEXER my_lex');
But I experienced the same behavior as before.
Could someone explain how Oracle Text treats numbers with separators and how I could configure the index so that separated numbers are treated as single words?
I'm using Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
Upvotes: 1
Views: 1402
Reputation: 4027
I am using Oracle 10g and I had the same issue you described with one minor difference.
I used this script:
begin
ctx_ddl.create_preference('my_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('my_lexer', 'numjoin', '.');
ctx_ddl.set_attribute('my_lexer', 'numgroup', ',');
ctx_ddl.set_attribute('my_lexer', 'printjoins', '-/"_');
ctx_ddl.set_attribute('my_lexer', 'index_text', 'YES');
end;
And this to create the index:
Create Index my_table_TIX01 ON my_table (doc) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('lexer my_lexer SYNC (ON COMMIT)')
This query returns 1 row in my case:
select score(1), id, doc from my_table where contains(doc, '1,000.00', 1) > 0;
SCORE(1),ID,DOC
2,4,1 FOO
The problem is how the tokens in the string you search for are parsed. Basically 1,000.00 is divided into multiple tokens: 1 then 000.00. You can experiment with this by inserting these rows:
insert into my_table (id, doc) values (5, '00 FOO');
insert into my_table (id, doc) values (6, '000.00 FOO');
commit; -- Dont' forget to commit the data you insert otherwise the index is not updated!
then run this query:
select score(1), id, doc from my_table where contains(doc, '1,000.00', 1) > 0;
It comes back with:
SCORE(1),ID,DOC
3,4,1 FOO
3,6,000.00 FOO
The comma is the same as the ACCUM. This is from the documentation (see the links at the bottom):
ACCUM , Use the ACCUM operator to search for documents that contain at least one occurrence of any of the query terms. The accumulate operator ranks documents according to the total term weight of a document. The following query returns all documents that contain the terms dogs, cats and puppies giving the highest scores to the documents that contain all three terms:
'dogs, cats, puppies'
To fix your search you need to enclose the string you search for between curly brackets:
These two queries will return the data you expect:
select score(1), id, doc from my_table where contains(doc, '{1,000.00}', 1) > 0;
select score(1), id, doc from my_table where contains(doc, 'FOO 300 BAR {1,000.00}', 1) > 0;
One more thing to be aware of is that if you search for 300 FOO, i.e. you run this query:
select score(1), id, doc from my_table where contains(doc, '300 FOO', 1) > 0;
it won't come back with anything. You either have to use relaxation templates or use AND or &: Examples:
select score(1), id, doc from my_table where contains(doc, '300 and FOO', 1) > 0;
select score(1), id, doc from my_table where contains(doc, '
<query>
<textquery lang="ENGLISH" grammar="CONTEXT">
<progression>
<seq>300 f00</seq>
<seq>300 NEAR f00</seq>
<seq>300 AND foo</seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>
', 1) > 0;
See also these links:
http://docs.oracle.com/cd/B19306_01/text.102/b14218/cqoper.htm#CCREF0300
http://docs.oracle.com/cd/B19306_01/text.102/b14217/query.htm#i1006507
http://docs.oracle.com/cd/B19306_01/text.102/b14218/csql.htm#i997503
I think we are very used to searching google and this doesn't work exactly as google does...
Upvotes: 1