Reputation: 2366
I'm working on a problem, which requires us to do exact word matching in Elasticsearch.
For example, if you want to search on the term 'Brighton Pier', it should match when searching 'brighton pier', but not on 'brighton' nor 'pier'.
I've worked out how to do this simply, by turning the fields I want to search to not_analyzed
.
However when not analysing it means stopwords, casing etc. will impact the results.
So is there a way to not analyse but still clean? Of course you can do cleaning prior to adding to the index, and with the search term itself, but this is tedious!
Upvotes: 0
Views: 38
Reputation: 8718
I think you can get what you want with the keyword tokenizer and the lowercase filter.
I'll give you a simple example. I set up an index like this, with a custom analyzer:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
Then I added a couple of documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"Brighton Pier"}
{"index":{"_id":2}}
{"text_field":"West Pier"}
It's helpful to take a look at the terms that are generated by the analyzer:
POST /test_index/_search?search_type=count
{
"aggs": {
"text_field_terms": {
"terms": {
"field": "text_field"
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"text_field_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "brighton pier",
"doc_count": 1
},
{
"key": "west pier",
"doc_count": 1
}
]
}
}
}
Since the custom analyzer is used both for indexing and searching (since I didn't specify them separately), as long as I use a match query, either of the following two queries will work:
POST /test_index/_search
{
"query": {
"match": {
"text_field": "Brighton Pier"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"text_field": "Brighton Pier"
}
}
]
}
}
POST /test_index/_search
{
"query": {
"term": {
"text_field": "brighton pier"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"text_field": "Brighton Pier"
}
}
]
}
}
However, if I use a term query (or filter), only the lowercase version will return a result.
Here is some code I used to play around with it:
http://sense.qbox.io/gist/d13a463af383c6fc5ad00d86bc27947c0016cf8f
Upvotes: 1