Massimo Modica
Massimo Modica

Reputation: 47

Add dynamic field based of file path

I'm using DIH and Tika to index documents in different languages.

There's a folder for each language (e.g. /de/file001.pdf), and I want to extract the language from path and then dynamically add the language specific solr field (e.g. text_de).

Here's my attempted solution:

<dataConfig>
  <script><![CDATA[
    function addField(row) {
      row.put('text_' + row.get('lang'), row.get('text'));
      return row;
    }
  ]]></script>
  <dataSource type="BinFileDataSource" />
    <document>
      <entity name="files" dataSource="null" rootEntity="false"
          processor="FileListEntityProcessor"
          baseDir="/tmp/documents" fileName=".*\.(doc)|(pdf)|(docx)"
          onError="skip"
          recursive="true"
          transformer="RegexTransformer" query="select * from files">

        <field column="fileAbsolutePath" name="id" />
        <field column="lang" regex=".*/(\w*)/.*" sourceColName="fileAbsolutePath"/>

        <entity name="documentImport"
            processor="TikaEntityProcessor"
            url="${files.fileAbsolutePath}"
            format="text"
            transformer="script:addField">

          <field column="date" name="date" meta="true"/>
          <field column="title" name="title" meta="true"/>
        </entity>

    </entity>
</document>

This doesn't work because row contains the 'text' field but not the 'lang' field.

Upvotes: 2

Views: 523

Answers (1)

Lodato L
Lodato L

Reputation: 146

The approach is correct, however the problem is that you are using a row that has as scope only the current row.

In order to access to parent row, you have to use the context variable that you receive as second actual parameter to script function. The Context variable has the ContextImpl implementation and on each script invocation, Solr ScriptTransformer will send you as second parameter (see transformRow) the same Context instance.

The following script will allow you to extract field value from the parent row and should address your problem:

<dataConfig>
<script><![CDATA[
    function addField(row, context) {
    var lang = context.getParentContext().resolve('files.lang');
    row.put('text_' + row.get('lang'), row.get('text'));
    return row;
}
]]></script>

Upvotes: 1

Related Questions