Jake
Jake

Reputation: 4670

ElasticSearch: Highlighting with Stemming

I have read this question and attempted to understand the documentation here, but this is complicated.

The problem (I think):

[update 1]

I am using Scala for my code and interface with ES High Level Java API.

I have a stemming analyzer configured. If I search for responsibilities i get results for responsibilities and responsibility. That's great.

BUT

Only the documents with the term responsibilities return highlights. This is because the search is on the stemmed content , i.e., responsib. However, the highlight is against the unstemmed content. Hence, it finds responsibilities which was a search criteria, but not responsibility, which wasn't.

If I set the highlighter to highlight on the stemmed content, it returns nothing at all. I guess because it is comparing resonsib with responsibilities

Search

I an using the Java high level API. The problem is not the code itself. Currently, I am highlighting only the content field, returning only responsibilities. Highlighting content.english seems to return nothing

 private def buildHighlighter(): HighlightBuilder = {
    import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder
    val highlightBuilder = new HighlightBuilder
    val highlightContent = new HighlightBuilder.Field("content")
    highlightContent.highlighterType("unified")
    highlightBuilder.field(highlightContent)
    highlightBuilder

  }

Mapping (adumbrated)

{
	"settings": {
		"number_of_shards": 3,
		"analysis": {
			"filter": {
				"english_stop": {
					"type": "stop",
					"stopwords": "_english_"
				},
				"english_keywords": {
					"type": "keyword_marker",
					"keywords": []
				},
				"english_stemmer": {
					"type": "stemmer",
					"language": "english"
				},
				"english_possessive_stemmer": {
					"type": "stemmer",
					"language": "possessive_english"
				}
			},
			"analyzer": {
				"english": {
					"tokenizer": "standard",
					"filter": [
						"english_possessive_stemmer",
						"lowercase",
						"english_stop",
						"english_keywords",
						"english_stemmer"
					]
				}
			}
		}
	},
	"mappings": {
		"_doc": {
			"properties": {
				"title": {
					"type": "text",
          "fields": {
           "english": {
             "type":     "text",
              "analyzer": "english"
            }
          }
				},
				"content": {
          "type": "text",
           "fields": {
            "english": {
              "type":     "text",
               "analyzer": "english"
             }
          }
			
			}
		}
	}
}

[update 2]

Scala code to implement search:

def searchByField(indices: Seq[ESIndexName], terms: Seq[(String, String)], size: Int = 20): SearchResponse = {

    val searchRequest = new SearchRequest
    searchRequest.indices(indices.map(idx => idx.completeIndexName()): _*)
    searchRequest.source(buildTargetFieldsMatchQuery(terms, size))

    searchRequest.indicesOptions(IndicesOptions.strictSingleIndexNoExpandForbidClosed())

    client.search(searchRequest, RequestOptions.DEFAULT)
  }

and query is built as follows:

private def buildTargetFieldsMatchQuery(termsByField: Seq[(String, String)], size: Int): SearchSourceBuilder = {

    val query = new BoolQueryBuilder

    termsByField.foreach {
      case (field, term) =>

        if (field == "content") {
          logger.debug(field + " should have " + term)
          query.should(new MatchQueryBuilder(field+standardAnalyzer, term.toLowerCase))
          query.should(new MatchQueryBuilder(field, term.toLowerCase))
        }
        else if (field == "title"){
          logger.debug(field + " should have " + term)
          query.should(new MatchQueryBuilder(field+standardAnalyzer, term.toLowerCase())).boost
        }
        else {
          logger.debug(field + " should have " + term)
        query.should(new MatchQueryBuilder(field, term.toLowerCase))
      }

    }
    val sourceBuilder: SearchSourceBuilder = new SearchSourceBuilder()
    sourceBuilder.query(query)
    sourceBuilder.from(0)
    sourceBuilder.size(size)
    sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS))
    sourceBuilder.highlighter(buildHighlighter())

  }

Upvotes: 1

Views: 698

Answers (1)

xeraa
xeraa

Reputation: 10859

With plain REST the following is working fine for me:

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_keywords": {
          "type": "keyword_marker",
          "keywords": []
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "english": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text",
          "fields": {
            "english": {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

POST test/_doc/
{
  "content": "This is my responsibility"
}

POST test/_doc/
{
  "content": "These are my responsibilities"
}

GET test/_search
{
  "query": {
    "match": {
      "content.english": "responsibilities"
    }
  },
  "highlight": {
    "fields": {
      "content.english": {
        "type": "unified"
      }
    }
  }
}

The result is then:

"hits" : [
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5D5PPGoBqgTTLzdtM-_Y",
    "_score" : 0.18232156,
    "_source" : {
      "content" : "This is my responsibility"
    },
    "highlight" : {
      "content.english" : [
        "This is my <em>responsibility</em>"
      ]
    }
  },
  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "5T5PPGoBqgTTLzdtZe8U",
    "_score" : 0.18232156,
    "_source" : {
      "content" : "These are my responsibilities"
    },
    "highlight" : {
      "content.english" : [
        "These are my <em>responsibilities</em>"
      ]
    }
  }
]

Looking at your Java / Groovy (?) code it looks close enough to the example in the docs. Could you log the actual query you are running, so we can figure out what is going wrong? Generally it should work like this.

Upvotes: 2

Related Questions