Reputation: 389
I would like to index multiple pdf files for the same Solr ID. For one of our projects, we have some objects representing like this :
{"id" : "object:1234",
"authors" : ["me", "you", ...],
"keywords": ["key1", "key3", ...],
"files" : [
"/tmp/file1.pdf",
"/tmp/file2.pdf",
"/tmp/file3.pdf"
]
}
We create a first process to quickly index basic metadata (all fields expect 'files') into our Solr6 server. Now we need a process to index all files content into Solr for the same ID.
So first process will create this Solr document (this process already works):
{"id":"object:1234",
"keywords":["key1", "key2"],
"authors": ["me", "you"],
"last_modified":"2017-09-04T12:00:00.000Z",
"_version_":1577256778756784128
}
And at the end of my second process, I would like than my solrDocument looks like this :
{"id":"object:1234",
"keywords":["key1", "key2"],
"authors": ["me", "you"],
"last_modified":"2017-09-04T13:00:00.000Z",
"content":["content_of_file1", "content_of_file2", ...],
"files":["/tmp/file1.pdf", "/tmp/file2.pdf", ...],
"_version_":1577256778756784129
}
Is it a easy way to do that using Solr handlers ?
At this time, the only solution that I found is to create a python script calling Tika to extract file content and use a Solr "parts of document update" to complete my Solr document. But this solution is not very elegant.... and doesn't works well with large files.
Do you know a better solution to solve my problem ?
Many thanks for your help.
Upvotes: 0
Views: 119
Reputation: 52792
I'm fairly sure you have to do exactly what you've done - call Solr's Tika with extractOnly=true
(or use Tika directly to get the data you need), then merge the content yourself and submit it as a single document to Solr. There is no inherent support for merging multiple files extracted into a set of multivalued fields.
However, I'd do everything in a single request instead of making an update for each document that you extract metadata about:
# pseudo code
document = {files: [], content: []}
for file in files:
document[files].append(file.name)
tika = solr.tika(extractOnly=true, read(file.name))
document[content].append(tika[content])
solr.add(document)
solr.commit()
Upvotes: 1