Reputation: 2790
I want to index a document that contains chinese characters/words. In some fields there is as well some HTML tags.
I used "html_strip" to avoid the HTML to be indexed but my problem is that the document is stored with the HTML in elasticsearch. This is my index settings and mappings:
PUT test
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"ch_analyzer": {
"tokenizer": "icu_tokenizer",
"char_filter": [ "html_strip" ]
}
}
}
},
"mappings": {
"qa": {
"properties": {
"comment_desc": {
"type": "text",
"analyzer": "ch_analyzer"
},
"article_title": {
"type": "text",
"analyzer": "ch_analyzer"
},
"article_desc": {
"type": "text",
"analyzer": "ch_analyzer"
}
}
},
"sport": {
"properties": {
"title": {
"type": "text",
"analyzer": "ch_analyzer"
},
"content": {
"type": "text",
"analyzer": "ch_analyzer"
}
}
}
}
}
For example I have the following content:
"<p><br/>台灣人,奧運直播,使用PPStream,(PPS網路電視),觀看同步奧運實況</b>!"
It is in fact indexed but it will be stored as is.
What change should I bring to my mappings to remove the HTML part in the stored document? What can I do to store this text in my Elasticsearch stripped of its HTML component?
Upvotes: 1
Views: 246
Reputation: 1494
If you want to do this on Elasticsearch (rather than as a preprocessing step), you have to use an ingest node. There's not any ingest processor that does exactly what you want, so you would have to use a script processor or write a plugin to do it.
Depending on your use case, it may be easier to do this in a pre-processing step (with code in your language of choice).
Upvotes: 1