Reputation: 548
I have indexed a pdf in solr and when i make a query for a text called BOEHRINGER, my xml response is as follows
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">text:BOEHRINGER</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="author">cjessen</str>
<arr name="content_type">
<str>application/pdf</str>
</arr>
<str name="id">2</str>
<date name="last_modified">2012-05-07T17:09:32Z</date>
</doc>
</result>
</response>
How do i get the contents to be returned as well as the file name as part of the XML response?? What field should be added to the schema.xml so that i can view the text from the pdf surrounding the word that i searched which is BOEHRINGER part of the XMl response.
Upvotes: 0
Views: 145
Reputation: 548
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
This is my solrconfig.xml file. All the fields in the schema.xml file have indexed and stored =true. I am still trying to get the text part of my response followed by the words around it. If sanjay was searched then i want part of my resposne to be "Sanjay is 6 ft tall" , also "sanjay is a good boy". Assuming both the sentences existed in the file that was indexed.
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" enerateWordParts="1"
generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
And the field is <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
Upvotes: 0
Reputation: 52809
Check for the field mapping attributes.
The Content of the file is usually mapped to text field, which is not stored by default.
Check ExtractingRequestHandler, the default is for the file contents are fmap.content=text
which can be overridden.
If you want to just check the content with the query highlighted, you can use the highlight feature of solr.
For the title of the document, you would either need to pass the title when you index the document or there should be an inbuilt file name field provided by Tika as a metadata field which you can use.
Upvotes: 1