Reputation: 67
I am trying to index HTML documents in English language using Elasticsearch. The data comes in raw HTML format. I have found a setting to filter HTML tags but I cannot use this filter along with the English analyzer.
I expect this setting to return three tokens but it returns five tokens because it considers "html" as a token twice.
POST _analyze
{
"analyzer": "english",
"char_filter": ["html_strip"],
"text": "<html>It will be raining in yosemite this weekend</html>"
}
How can I get only three tokens (no HTML tags) for the text above so my return would look like the following?
{
"tokens": [
{
"token": "rain",
"start_offset": 11,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "yosemit",
"start_offset": 22,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "weekend",
"start_offset": 36,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 7
}
]
}
Upvotes: 1
Views: 766
Reputation: 1337
Define a custom analyzer that just uses the english analyzer as the base template and add the html strip filter to it.
PUT /english_with_html_strip
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_html_strip": {
"tokenizer": "standard",
"char_filter": ["html_strip"],
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
Then you can do
POST /english_with_html_strip/_analyze
{
"analyzer": "english_with_html_strip",
"text": "<html>It will be raining in yosemite this weekend</html>"
}
This is assuming you want to analyze the text using english analyzer. If you just want it tokenized stripping html you can just do
POST _analyze
{
"tokenizer": "standard",
"char_filter": [ "html_strip" ],
"text": "<html>It will be raining in yosemite this weekend</html>"
}
Upvotes: 2