joadha
joadha

Reputation: 191

How does one instruct the ExtractingRequestHandler to parse only the body of a document?

How can I instruct the extracting request handler to ignore metadata/headers etc. when it constructs the "content" of the document I send to it?

For example, I created an MS Word document containing just the word "SEARCHWORD" and nothing else. However, when I ship this doc to my solr index, its contents are mapped to my "body" field as follows:

<str name="body">
    Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info 
    myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name
     Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 
    108600000000 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream 
    Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y 
    Some Company Content-Type application/msword Keywords Last-Save-Date 
    2012-05-01T18:55:00Z SEARCHWORD
</str>

All I want is the body of the document, in this case the word "SEARCHWORD."

For further reference, here's my extraction handler:

 <requestHandler name="/update/extract" 
                 startup="lazy"
                 class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">body</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
    </lst>
  </requestHandler>

Upvotes: 2

Views: 1375

Answers (2)

joadha
joadha

Reputation: 191

The awesome people on the solr-user mailing list got to the the bottom of this. It turns out the field name "meta" is a special case: the ExtractingRH copies all metadata to this field. In my case, I was getting the contents, too, because of the fmap.contents mapping in my own ERH. I renamed my "meta" field to something else, and now it receives only the contents of the document.

This behavior is not currently documented in the Solr wiki. I hope this helps someone else who may have a field named "meta" in their schema to which they're extracting document contents (unlikely, I know).

Upvotes: 4

Marko Bonaci
Marko Bonaci

Reputation: 5706

Have you tried adding XPath param to defaults:

<str name="XPath">/xhtml:body</str>

You can quickly test it with url, like the above link shows.

Upvotes: 1

Related Questions