Ivan Klemenko
Ivan Klemenko

Reputation: 39

Solr Data Import from S3 with Base64 File Retrieval: How to Configure Solr to Access File Data via UUID and File Type

Previously, files were stored in the database, and indexing in Solr was done as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
    <dataSource name="fieldStream" type="FieldStreamDataSource"/>
    <dataSource name="db"
                type="JdbcDataSource"
                driver="oracle.jdbc.OracleDriver"
                url=""
                user=""
                password=""/>
    <document>
        <entity name="attachments"
                query="select tdata.ID as DATA_ID, null, tdata.NAME, att.DATA AS TEXT,
                       
                from T_DATA tdata 
                left join ATTACHED_DATA att on tdata.id = att.TDATA_ID where tdata.ACTIVE = 1
                     
                pk="DATA_ID"
                dataSource="db">
            <field column="DATA_ID" name="dataId"/>
            <field column="NAME" name="dataName"/>
   
            <entity name="attachment"
                    dataSource="fieldStream"
                    processor="TikaEntityProcessor"
                    tikaConfig="tikaconfig.xml"
                    dataField="attachments.TEXT"
                    url="TEXT"
                    format="text"
                    onError="continue">
                <field column="text" name="docBody"/>
            </entity>
        </entity>
    </document>
</dataConfig>

Now, files are stored in S3. A new UUID and FILE_TYPE field have been added to the table, and there is a service that returns the file data in base64 format via a GET request based on UUID and FILE_TYPE. The response from the service looks like this:

{"id": "d32eba0a-ad02-4da0-bd0e-a5f2bca8b1fc","extension": "txt","file": "MTIz"}

I have tried several approaches and created a TikaTransformer.jar that calls this service. I placed it in opt/solr/server/lib. Here are the variations I tried: Using an URLDataSource:

<dataSource name="s3Data"
            type="URLDataSource"
            baseUrl="http://localhost:8080"
            encoding="UTF-8"
            connectionTimeout="5000"
            readTimeout="10000"/>

Using an XPath processor with a custom transformer:

<entity name="attachment"
                processor="XPathEntityProcessor"
                url="/s3/data/file/${uuid}/${fileType}"
                forEach="/"
                transformer="script:transformer">
            
            <field column="file" name="base64File" xpath="/file"/>
        </entity>

<script>
        function transformer(row) {
            if (row.base64File) {
                var decoded = java.util.Base64.getDecoder().decode(row.base64File);
                row.docBody = new java.lang.String(decoded, "UTF-8");
            }
            return row;
        }
        </script>

And more others..

<field column="UUID" name="uuid"/>
        <field column="FILE_TYPE" name="file_type"/>
        <entity name="attachment"
                dataSource="s3Data"
                processor="TikaEntityProcessor"
                tikaConfig="tikaconfig.xml"
                url="/s3/data/file/${attachments.UUID}/${attachments.FILE_TYPE}"
                format="text"
                onError="continue">
            <field column="text" name="docBody"/>
        </entity>

However, I still cannot access the service to retrieve the file. Could you please guide me on where I should focus my efforts, and what I might need to fix? If necessary, I can modify the service to return the file in a different format.

Upvotes: 1

Views: 31

Answers (0)

Related Questions