Renaud Michotte
Renaud Michotte

Reputation: 389

Indexing multiple binary files into unique solrDocument

I would like to index multiple pdf files for the same Solr ID. For one of our projects, we have some objects representing like this :

{"id"      : "object:1234",
 "authors" : ["me", "you", ...],
 "keywords": ["key1", "key3", ...],
 "files"   : [
   "/tmp/file1.pdf",
   "/tmp/file2.pdf",
   "/tmp/file3.pdf"
 ]
}

We create a first process to quickly index basic metadata (all fields expect 'files') into our Solr6 server. Now we need a process to index all files content into Solr for the same ID.

So first process will create this Solr document (this process already works):

{"id":"object:1234",
 "keywords":["key1", "key2"],
 "authors": ["me", "you"],
 "last_modified":"2017-09-04T12:00:00.000Z",
 "_version_":1577256778756784128
}

And at the end of my second process, I would like than my solrDocument looks like this :

{"id":"object:1234",
 "keywords":["key1", "key2"],
 "authors": ["me", "you"],
 "last_modified":"2017-09-04T13:00:00.000Z",
 "content":["content_of_file1", "content_of_file2", ...],
 "files":["/tmp/file1.pdf", "/tmp/file2.pdf", ...],
 "_version_":1577256778756784129
}

Is it a easy way to do that using Solr handlers ?
At this time, the only solution that I found is to create a python script calling Tika to extract file content and use a Solr "parts of document update" to complete my Solr document. But this solution is not very elegant.... and doesn't works well with large files.

Do you know a better solution to solve my problem ?
Many thanks for your help.

Upvotes: 0

Views: 119

Answers (1)

MatsLindh
MatsLindh

Reputation: 52792

I'm fairly sure you have to do exactly what you've done - call Solr's Tika with extractOnly=true (or use Tika directly to get the data you need), then merge the content yourself and submit it as a single document to Solr. There is no inherent support for merging multiple files extracted into a set of multivalued fields.

However, I'd do everything in a single request instead of making an update for each document that you extract metadata about:

# pseudo code
document = {files: [], content: []}

for file in files:
    document[files].append(file.name)

    tika = solr.tika(extractOnly=true, read(file.name))
    document[content].append(tika[content])

solr.add(document)
solr.commit()

Upvotes: 1

Related Questions