Reputation: 43
I need texts like #tag1 quick brown fox #tag2 to be tokenized into #tag1
, quick
, brown
, fox
, #tag2
, so I can search this text on any of the patterns #tag1
, quick
, brown
, fox
, #tag2
where the symbol #
must be included in the search term. In my index mapping I have a text
type field (to search on quick
, brown
, fox
) with the keyword
type subfield (to search on #tag
), and when I use search term #tag
it gives me only the match on the first token #tag1
but not on #tag1
.
I think what I need is a tokenizer that will produce word boundary tokens that inlcude special chars. Can someone suggest a solution?
Upvotes: 1
Views: 3779
Reputation: 43
Thanks to @Kaveh suggestion, I found my mistake. My custom analyzer (with lots of filters, etc) was using standard tokenizer which I thought is similar to whitespace tokenizer. Once I switched to whitespace tokenizer in my custom analyzer I can see that the analyzer doesn't strip #
from the beginning of the words anymore, and I can search on patterns started with #
using simple_query_string
query type.
Upvotes: 1
Reputation: 1310
If you want to include #
in your search, you should use different analyzer than standard analyzer
because #
will be removed during analyze phase. You can use whitespace analyzer
to analyze your text field.
Also for search you can use wildcard pattern:
Query:
GET [Your index name]/_search
{
"query": {
"match": {
"[FieldName]": "#tag*"
}
}
}
You can find information about Elastic built-in analyzer here.
UPDATE:
Whitespace analyzer:
POST /_analyze
{
"analyzer": "whitespace",
"text": "#tag1 quick #tag2"
}
Result:
{
"tokens" : [
{
"token" : "#tag1",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "#tag2",
"start_offset" : 12,
"end_offset" : 17,
"type" : "word",
"position" : 2
}
]
}
As you can see #tag1 and #tag2 are two tokens.
Upvotes: 1