How to store information in solr?

I recently began to learn solr, for me some things remain incomprehensible, I will explain what I'm trying to do, please tell me which way to go.

I need a web application in which it will be possible to save data, some fields from which will be in the form of text, some in the form of a file, how to add fields in the form of text is understandable, it is impossible to add files, or their contents as text, in this case I do not know where to store the file itself?

If you need to find a file and it will be known only a couple of words from the entire file, I want all the files to appear in which there are these words, should I add a separate database in this case? If so, where to store the files? if not, the same question.

I would be very pleased and understandable to look at it on some example, maybe you have a link?

Upvotes: 0

Views: 302

Answers (2)

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8668

As MatsLindh already mentioned a approach to achieve what you are looking for.

Here are some step by which you can index the files with known location.

Update the solrConfig.xml with below lines

<!-- Load Data Import Handler and Apache Tika (extraction) libraries -->
    <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar"/>
    <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar"/>

  <requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">tika-data-config.xml</str>
    </lst>
  </requestHandler>

Create a file named tika-data-config.xml under the G:\Solr\TikaConf\conf folder. with below configuration. This location could be different for you.

<dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="file" processor="FileListEntityProcessor" dataSource="null"
            baseDir="G:/Solr/solr-7.7.2/example/exampledocs" fileName=".*xml"
            rootEntity="false">

      <field column="file" name="id"/>

      <entity name="pdf" processor="TikaEntityProcessor"
              url="${file.fileAbsolutePath}" format="text">

        <field column="text" name="text"/>

      </entity>
    </entity>
  </document>
</dataConfig>

Add the below fields in your schema.xml

<field name="text" type="text_general" indexed="true" stored="true" multiValued="false"/>

Update the solrConfig xml file as below in order to disable the schemaless mode

<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

Go to the solr admin page and select the core you created and click on data import.

Solr Data Import

Once data is imported or indexed, you can verify the same by querying it.

Solr Query Page

If you file location is dynamic, means you are retrieving the file location from the database and then that would be your first entity which is retrieving the information from your database about the files metadata like id,name,author and the file path etc..In the second entity which is TikaEntityProcessor, pass the file path and get the content of the file indexed...

Upvotes: 0

MatsLindh
MatsLindh

Reputation: 52892

This is far too wide and non-specific to give an answer you can just implement; in general you'd submit the documents together with an id to Solr (through Tika in the Extracting Request Handler / Solr Cell).

The documents itself will have to be stored somewhere else, as Solr doesn't handle document storage for you. They can be stored on a cloud service, on a network drive or a local disk - this will depend on your web application.

Your application will then receive the file from the user, store a database row assigning the file to the user, store the file somewhere (S3/GoogleCloudStorage/Local path) under a well-known name (usually the id of the row from the database) and submit the content to Solr for indexing - together with metadata (such as the user id) and the file id.

Searching will then give you the id back and you can retrieve the document from wherever you stored it.

Upvotes: 1

Related Questions