How to store in elasticsearch a document without HTML?

Question

I want to index a document that contains chinese characters/words. In some fields there is as well some HTML tags.

I used "html_strip" to avoid the HTML to be indexed but my problem is that the document is stored with the HTML in elasticsearch. This is my index settings and mappings:

PUT test
{
  "settings" : {
    "index" : {
        "number_of_shards" : 1, 
        "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "ch_analyzer": {
          "tokenizer": "icu_tokenizer",
          "char_filter":  [ "html_strip" ]
        }
      }
    }
  },
  "mappings": {
    "qa": {
      "properties": {
        "comment_desc": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        },
        "article_title": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        },
        "article_desc": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        }
      }
    }, 
    "sport": {
      "properties": {
        "title": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        },
        "content": {
          "type":     "text",
          "analyzer": "ch_analyzer"
        }
      }
    }
  }
}

For example I have the following content:

"

台灣人，奧運直播，使用PPStream，(PPS網路電視)，觀看同步奧運實況!"

It is in fact indexed but it will be stored as is.

What change should I bring to my mappings to remove the HTML part in the stored document? What can I do to store this text in my Elasticsearch stripped of its HTML component?

dshockley · Accepted Answer

If you want to do this on Elasticsearch (rather than as a preprocessing step), you have to use an ingest node. There's not any ingest processor that does exactly what you want, so you would have to use a script processor or write a plugin to do it.

Depending on your use case, it may be easier to do this in a pre-processing step (with code in your language of choice).

How to store in elasticsearch a document without HTML?

Answers (1)

Related Questions