Scouse_Bob
Scouse_Bob

Reputation: 546

Is there a way to search a string within a document using oracle text and return more than just the document name

I have been through the below article:

https://community.oracle.com/tech/developers/discussion/2179837/searching-a-string-in-a-pdf-blob

It describes how to upload a document such as a pdf or word doc as a blob into a database table and then using oracle text, search for a string in that document to tell you the document that the string is in.

I cannot find a means by which you can for example search for a string in a word document and then return the paragraph, or some text that includes that string within the document.

What if you uploaded a hundred word documents,and you wanted to use oracle text to search for any that included a string? Does Oracle Text let you reference not just the document but the paragraph your string is in?

Help appreciated, I can find no examples of this unfortunately.

Upvotes: 1

Views: 450

Answers (1)

davidm
davidm

Reputation: 1760

You can use the CTX_DOC package for this. I will show you a short example on how to use it.

First create the table:

CREATE TABLE x_files
(   
    id_ INTEGER,
    f_name VARCHAR2(30),
    p_file BLOB,
    FORMAT VARCHAR2(30),
    CONSTRAINT id_pk PRIMARY KEY (id_)    
);

Next insert some PDFs, DOCs, TXTs and so on, afterwards create the index as:

CREATE INDEX i_x_files ON x_files(p_file) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS 
('DATASTORE CTXSYS.DEFAULT_DATASTORE FILTER CTXSYS.AUTO_FILTER FORMAT COLUMN FORMAT');

Finally check this example PL/SQL anonymous procedure:

SET SERVEROUTPUT ON;
DECLARE
    v_query VARCHAR2(255) := 'your-query';
    t_restab ctx_doc.highlight_tab;
BEGIN

    FOR r_pdf IN (SELECT id_, p_file FROM x_files WHERE contains(p_file,v_query) > 0) LOOP

        ctx_doc.highlight -- ctx_doc.markup is also very useful
        (
            index_name  => 'i_x_files', 
            textkey     => r_pdf.id_, 
            text_query  => v_query, 
            restab      => t_restab, 
            plaintext   => FALSE
        );
        FOR I IN t_restab.FIRST..t_restab.LAST LOOP                
            dbms_output.put_line(
                utl_raw.cast_to_varchar2(
                    dbms_lob.substr(r_pdf.p_file, t_restab(I).OFFSET, t_restab(I).LENGTH)
                )
            );
        END LOOP;

        dbms_output.put_line('');

    END LOOP;
END;
/

Upvotes: 0

Related Questions