Elasticsearch combining language and char_filter in an analyzer

Question

I'm trying to combine a language analyzer with a char_filter but when I look at the _termvectors for the field the html/xml tags I can see values in there that are attributes of custom xml tags like "22anchor_titl".

My idea was to extend the german language filter:

settings: 
  analysis:
    analyzer:
      node_body_analyzer:
        type: 'german'
        char_filter: ['html_strip']

mappings:
  mappings:
    node:
      body:
      type: 'string'
      analyzer: 'node_body_analyzer'
      search_analyzer: 'node_search_analyzer'

Is there an error in my configuration or is the concept of deriving a new analyzer from the 'gernam' by adding a char_filter simply not possible. If so, would I have to make a type: 'custom' analyzer, implement the whole thing like this documentation and add the filter?

Cheers

Andrei Stefan · Accepted Answer

Yes, you need to do that. What if you wanted to add another token filter? Where should have ES placed that one in the list of already existent token filters (since the order matters)? You need something like this:

"analysis": {
  "filter": {
    "german_stop": {
      "type":       "stop",
      "stopwords":  "_german_" 
    },
    "german_keywords": {
      "type":       "keyword_marker",
      "keywords":   ["ghj"] 
    },
    "german_stemmer": {
      "type":       "stemmer",
      "language":   "light_german"
    }
  },
  "analyzer": {
    "my_analyzer": {
      "type":"custom",
      "tokenizer":  "standard",
      "filter": [
        "lowercase",
        "german_stop",
        "german_keywords",
        "german_normalization",
        "german_stemmer"
      ],
      "char_filter":"html_strip"
    }
  }
}

Elasticsearch combining language and char_filter in an analyzer

Answers (1)

Related Questions