Reputation: 2855
We have an elastic search 5.5 setup. We use nest to perform our queries through C#.
When executing the following query:
{ "query": { "bool": { "must": [ { "query_string": { "query": "00917751" } } ] } } }
We get the desired result: one result with that the number as identifier.
When executing the query:
{ "query": { "bool": { "must": [ { "query_string": { "query": "917751" } } ] } } }
We get no results.
The value we are searching for is in the field searchIndentifier, and has the value "1-00917751".
We have a custom analyzer called "final"
.Custom("final", cu => cu .Tokenizer("keyword").Filters(new List() { "lowercase" }))
The field searchIndentifier has no custom analyzer set on it. I tried adding the whitespace tokenizer in it but that made no difference.
Another field called "searchObjectNo" does work, when I try to search for the value "S328-25" with the query "S328". These fields are exactly the same.
Any ideas here?
Another related question: When executing the query
{ "query": { "bool": { "must": [ { "query_string": { "query": "1-00917751" } } ] } } }
we get a lot of results. I would like this to return only 1 result. How would we accomplish this?
Thank you Schoof
Settings and mapping: https://jsonblob.com/9dbf33f6-cd3e-11e8-8f17-c9de91b6f9d1
Upvotes: 0
Views: 2574
Reputation: 125488
The searchIndentifier
field is mapped as a text
datatype, which will undergo analysis and use the Standard Analyzer by default. Using the Analyze API, you can see what terms will be stored in the inverted index for 1-00917751
var client = new ElasticClient();
var analyzeResponse = client.Analyze(a => a
.Text("1-00917751")
);
which returns
{
"tokens" : [
{
"token" : "1",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "00917751",
"start_offset" : 2,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 1
}
]
}
You'll get a match for the query_string
query with a query input of 00917751
as this matches one of the terms stored in the inverted index as a result of analysis at index time for the input 1-00917751
.
You won't get a match for 917751
as there is not a term in the inverted index that will match. You could define an analysis chain that removes leading zeroes from numbers as well as preserving the original token e.g.
private static void Main()
{
var defaultIndex = "foobarbaz";
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var settings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(settings);
client.CreateIndex(defaultIndex, c => c
.Settings(s => s
.Analysis(a => a
.Analyzers(an => an
.Custom("trim_leading_zero", ca => ca
.Tokenizer("standard")
.Filters(
"standard",
"lowercase",
"trim_leading_zero",
"trim_zero_length")
)
)
.TokenFilters(tf => tf
.PatternReplace("trim_leading_zero", pr => pr
.Pattern("^0+(.*)")
.Replacement("$1")
)
.Length("trim_zero_length", t => t
.Min(1)
)
)
)
)
.Mappings(m => m
.Map<MyDocument>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.SearchIndentifier)
.Analyzer("trim_leading_zero")
.Fields(f => f
.Keyword(k => k
.Name("keyword")
.IgnoreAbove(256)
)
)
)
)
)
)
);
client.Index(new MyDocument { SearchIndentifier = "1-00917751" }, i => i
.Refresh(Refresh.WaitFor)
);
client.Search<MyDocument>(s => s
.Query(q => q
.QueryString(qs => qs
.Query("917751")
)
)
);
}
public class MyDocument
{
public string SearchIndentifier { get; set; }
}
The pattern_replacement
token filter will trim leading zeroes from tokens.
the search query returns the indexed document
{
"took" : 69,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.33310556,
"hits" : [
{
"_index" : "foobarbaz",
"_type" : "mydocument",
"_id" : "MVF4bmYBJZHQAT-BUx1K",
"_score" : 0.33310556,
"_source" : {
"searchIndentifier" : "1-00917751"
}
}
]
}
}
Upvotes: 1