Reputation: 1431
I have a small set of descriptive metadata (~50) and for each of them a corresponding full text file (.txt). My understanding is that the Apache Tika framework is used for detecting and extracting metadata and structured text from various types of documents. However, I would also need to implement a linkage mechanism whereby a given metadata is matched to its full-text. Can this be done in Solr?
Thanks,
Ilaria
Upvotes: 1
Views: 2546
Reputation: 9049
If you have metadata and the document content, you can index the metadata and store the content. Your field definition would look something like this
<field name="filename" type="text" indexed="true" stored="true"/>
... <!-- other metadata /-->
<field name="content" type="text" indexed="false" stored="true"/>
This will allow you to search by any metadata, and give you back the content. You can add as much meta information as required to search the text. I wouldn't index the full text as there is already some structured metadata available.
Apache TIKA extracts meta information from HTML pages etc. Since you already have the metadata available, you need not use TIKA. Besides, AFAIK, Tika does not work with plain text files.
Edit 1:
Ok, so the link between the metadata and content will be maintained in Solr. For ex, if you have
File1.txt <-> Metadata1.txt
You could have one record (document) in Solr that has (no. of metadatafields + 1 plaintextcontent field). This gives you the flexibility to look up the document by any metadata. For example,
q=filename:File1.txt
or
q=filesize:[1 to 100]
where filename
and filesize
are example metadata fields. plaintextcontent
would be your text file content, so thus in your Solr schema, you have your link.
Now the trick is to setup the indexing. Here's one way to do it -
Indexing the text file is very simple. You could use the DataImportHandler's PlainTextEntityProcessor.
Indexing the metadata along with it could be slightly tricky (need to understand the structure of metadata). You could use LineEntityProcessor or any one of the Transformers of DataImportHandler, depending on what suits you best.
Upvotes: 3