Reputation: 8322
I am able to upload the pdf files into solr
and I am able to search those files. But what is indexing in solr
? Wwhen I upload a pdf file how it will do the indexing?
This is the code I use to upload the pdf file
ContentStreamUpdateRequest up
= new ContentStreamUpdateRequest("/update/extract");
up.addFile(fileName);
up.setParam("literal.id", solrId);
up.setParam("literal.first_name", "apachesolr");
up.setParam("literal.last_name", "cookbook");
up.setParam("literal.age", "30");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(up);
below is my schema.xml
<field name="first_name" type="string" indexed="true" stored="true" required="true"/>
<field name="last_name" type="string" indexed="true" stored="true" required="true"/>
<field name="age" type="int" indexed="true" stored="true" required="true"/>
<field name="created_at" type="date" indexed="true" stored="true"/>
<field name="updated_at" type="date" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true"/>
when i am searching with the any content in the pdf. the result look like this
SolrDocument[{
last_modified=Fri Oct 17 08:17:38 IST 2003,
author=Mark Roth, Eduardo Pelegri-Llopart,
title=[JSP 2.0 Specification, Final Release],
content_type=[application/pdf],
keywords=JSP,
age=30,
last_name=cookbook,
first_name=apachesolr,
id=jsp-2_0-fr-spec.pdf
}]
How it will be able to get the title, author, keywords... etc?
Upvotes: 3
Views: 2355
Reputation: 28552
You misunderstand concept of document in search engines. Document is a set of named fields with corresponding values. You should always explicitly set each field. To start with, try the following code with Solrj:
CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
for(int i = 0; i < 1000; ++i) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("title", "My Favorite book");
doc.addField("author", "Kevin");
doc.addField("content", "Bla bla bla");
solr.add(doc);
}
solr.commit();
This code creates new SolrInputDocument
and adds 3 fields - "title", "author" and "content" (note: all these fields should be defined in schema.xml, just to let Solr know how to index and store these fields), then it adds new doc to transaction (solr.add(doc)
) and finally commits changes. This is the basic way to work with Solr.
In this normal flow you should extract text from documents yourself. For example, you may use Tika for this purpose. This is the most flexible and fine-grained way.
What you are trying to do is to use new Solr feature - content extraction. If I understand it correctly, you are trying to set field with setParams()
which is wrong. setParams()
only sets request parameters, that are then translated into URL params to let Solr know how to handle request itself. As far as I know, this way you cannot set fields yourself. Instead, /update/extract
handler will try to extract contents by file's MIME type, find hints about document attributes and use them as fields (note that Solr uses Tika library to extract document contents). So, if you really want to use /update/extract
handler, try to follow this example without altering lines corresponding to request params and check what fields where generated.
Upvotes: 4