John
John

Reputation: 67

Searching English Text in HTML Using Elasticsearch

I am trying to index HTML documents in English language using Elasticsearch. The data comes in raw HTML format. I have found a setting to filter HTML tags but I cannot use this filter along with the English analyzer.

I expect this setting to return three tokens but it returns five tokens because it considers "html" as a token twice.

POST _analyze
{
  "analyzer": "english", 
  "char_filter": ["html_strip"], 
  "text": "<html>It will be raining in yosemite this weekend</html>"
}

How can I get only three tokens (no HTML tags) for the text above so my return would look like the following?

{
  "tokens": [
    {
      "token": "rain",
      "start_offset": 11,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "yosemit",
      "start_offset": 22,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "weekend",
      "start_offset": 36,
      "end_offset": 43,
      "type": "<ALPHANUM>",
      "position": 7
    }
  ]
}

Upvotes: 1

Views: 766

Answers (1)

sramalingam24
sramalingam24

Reputation: 1337

Define a custom analyzer that just uses the english analyzer as the base template and add the html strip filter to it.

PUT /english_with_html_strip
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english_with_html_strip": {
          "tokenizer":  "standard",
          "char_filter": ["html_strip"],
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

Then you can do

POST /english_with_html_strip/_analyze
{
  "analyzer": "english_with_html_strip", 
  "text": "<html>It will be raining in yosemite this weekend</html>"
}

This is assuming you want to analyze the text using english analyzer. If you just want it tokenized stripping html you can just do

POST _analyze
    {
      "tokenizer":      "standard", 
      "char_filter":  [ "html_strip" ],
      "text": "<html>It will be raining in yosemite this weekend</html>"
    }

Upvotes: 2

Related Questions