Reputation: 4696
I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well.
The logic for indexing is:
Suppose the schema.xml
has the following fields defined:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="created" type="tlong" indexed="true" stored="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="filepath" type="text_general" indexed="false" stored="true"/>
<field name="filecontent" type="text_general" indexed="false" stored="true"/>
A single database entry may/may not have a file stored.
Hence, following is my code for indexing:
$post = stdclass object having the database content
$doc = new SolrInputDocument();
$doc->addField('id', $post->id);
$doc->addField('name', $post->name);
....
....
$res = $client->addDocument($doc);
$client->commit();
Next, I want to add the contents of the PDF file in the same solr document as above.
This is the curl
code:
$ch = curl_init('
http://localhost:8010/solr/update/extract?');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);
But, I guess I'm missing something. I read the documentation, but I cannot figure out a way of retrieving the contents of the file and then adding it to the existing solr document in the field: filecontent
EDIT #1:
If I try to set literal.id=xyz
in the curl request, it creates a new solr document having id=xyz
. I don't want a new solr document created. I want the contents of the pdf to be indexed and stored as a field in the previously created solr document.
$doc = new SolrInputDocument();//Solr document is created
$doc->addField('id', 98765);//The solr document created above is assigned an id=`98765`
....
....
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);
I want the above solr document (id = 98765
) to have a field in which the contents of the pdf get indexed & stored.
But the cURL request (as above) creates another new document (with id = 1
). I don't want that.
Upvotes: 3
Views: 7637
Reputation: 52809
Solr with Apache Tika does the handling of extracting the Contents of the Rich Documents and adding it back to the Solr document.
You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:
Default schema.xml :-
<!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
If you are defining a different attribute for maintaining the file contents override the default with fmap.content=filecontent
in the solrconfig.xml itself.
The fmap.content=attr_content param overrides the default fmap.content=text causing the content to be added to the attr_content field instead.
If you want to index it in a single documment use literal prefix e.g. literal.id=1&literal.name=Name
with the attributes
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);
Upvotes: 2