Reputation: 39
Previously, files were stored in the database, and indexing in Solr was done as follows:
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="fieldStream" type="FieldStreamDataSource"/>
<dataSource name="db"
type="JdbcDataSource"
driver="oracle.jdbc.OracleDriver"
url=""
user=""
password=""/>
<document>
<entity name="attachments"
query="select tdata.ID as DATA_ID, null, tdata.NAME, att.DATA AS TEXT,
from T_DATA tdata
left join ATTACHED_DATA att on tdata.id = att.TDATA_ID where tdata.ACTIVE = 1
pk="DATA_ID"
dataSource="db">
<field column="DATA_ID" name="dataId"/>
<field column="NAME" name="dataName"/>
<entity name="attachment"
dataSource="fieldStream"
processor="TikaEntityProcessor"
tikaConfig="tikaconfig.xml"
dataField="attachments.TEXT"
url="TEXT"
format="text"
onError="continue">
<field column="text" name="docBody"/>
</entity>
</entity>
</document>
</dataConfig>
Now, files are stored in S3. A new UUID and FILE_TYPE field have been added to the table, and there is a service that returns the file data in base64 format via a GET request based on UUID and FILE_TYPE. The response from the service looks like this:
{"id": "d32eba0a-ad02-4da0-bd0e-a5f2bca8b1fc","extension": "txt","file": "MTIz"}
I have tried several approaches and created a TikaTransformer.jar that calls this service. I placed it in opt/solr/server/lib. Here are the variations I tried: Using an URLDataSource:
<dataSource name="s3Data"
type="URLDataSource"
baseUrl="http://localhost:8080"
encoding="UTF-8"
connectionTimeout="5000"
readTimeout="10000"/>
Using an XPath processor with a custom transformer:
<entity name="attachment"
processor="XPathEntityProcessor"
url="/s3/data/file/${uuid}/${fileType}"
forEach="/"
transformer="script:transformer">
<field column="file" name="base64File" xpath="/file"/>
</entity>
<script>
function transformer(row) {
if (row.base64File) {
var decoded = java.util.Base64.getDecoder().decode(row.base64File);
row.docBody = new java.lang.String(decoded, "UTF-8");
}
return row;
}
</script>
And more others..
<field column="UUID" name="uuid"/>
<field column="FILE_TYPE" name="file_type"/>
<entity name="attachment"
dataSource="s3Data"
processor="TikaEntityProcessor"
tikaConfig="tikaconfig.xml"
url="/s3/data/file/${attachments.UUID}/${attachments.FILE_TYPE}"
format="text"
onError="continue">
<field column="text" name="docBody"/>
</entity>
However, I still cannot access the service to retrieve the file. Could you please guide me on where I should focus my efforts, and what I might need to fix? If necessary, I can modify the service to return the file in a different format.
Upvotes: 1
Views: 31