panza
panza

Reputation: 1431

Indexing full-text and descriptive metadata in Solr

I have a small set of descriptive metadata (~50) and for each of them a corresponding full text file (.txt). My understanding is that the Apache Tika framework is used for detecting and extracting metadata and structured text from various types of documents. However, I would also need to implement a linkage mechanism whereby a given metadata is matched to its full-text. Can this be done in Solr?

Thanks,

Ilaria

Upvotes: 1

Views: 2546

Answers (1)

Srikanth Venugopalan
Srikanth Venugopalan

Reputation: 9049

If you have metadata and the document content, you can index the metadata and store the content. Your field definition would look something like this

<field name="filename" type="text" indexed="true" stored="true"/>
... <!-- other metadata /-->
<field name="content" type="text" indexed="false" stored="true"/>

This will allow you to search by any metadata, and give you back the content. You can add as much meta information as required to search the text. I wouldn't index the full text as there is already some structured metadata available.

Apache TIKA extracts meta information from HTML pages etc. Since you already have the metadata available, you need not use TIKA. Besides, AFAIK, Tika does not work with plain text files.

Edit 1:

Ok, so the link between the metadata and content will be maintained in Solr. For ex, if you have

File1.txt <-> Metadata1.txt

You could have one record (document) in Solr that has (no. of metadatafields + 1 plaintextcontent field). This gives you the flexibility to look up the document by any metadata. For example,

q=filename:File1.txt

or

q=filesize:[1 to 100]

where filename and filesize are example metadata fields. plaintextcontent would be your text file content, so thus in your Solr schema, you have your link.

Now the trick is to setup the indexing. Here's one way to do it -

Indexing the text file is very simple. You could use the DataImportHandler's PlainTextEntityProcessor.

Indexing the metadata along with it could be slightly tricky (need to understand the structure of metadata). You could use LineEntityProcessor or any one of the Transformers of DataImportHandler, depending on what suits you best.

Upvotes: 3

Related Questions