Schoof
Schoof

Reputation: 2855

Elastic Search: Query string and number not always returning wanted result

We have an elastic search 5.5 setup. We use nest to perform our queries through C#.

When executing the following query:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "00917751"
          }
        }
      ]
    }
  }
}

We get the desired result: one result with that the number as identifier.

When executing the query:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "917751"
          }
        }
      ]
    }
  }
}

We get no results.

The value we are searching for is in the field searchIndentifier, and has the value "1-00917751".

We have a custom analyzer called "final"

.Custom("final", cu => cu .Tokenizer("keyword").Filters(new List() { "lowercase" }))

The field searchIndentifier has no custom analyzer set on it. I tried adding the whitespace tokenizer in it but that made no difference.

Another field called "searchObjectNo" does work, when I try to search for the value "S328-25" with the query "S328". These fields are exactly the same.

Any ideas here?

Another related question: When executing the query

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "1-00917751"
          }
        }
      ]
    }
  }
}

we get a lot of results. I would like this to return only 1 result. How would we accomplish this?

Thank you Schoof

Settings and mapping: https://jsonblob.com/9dbf33f6-cd3e-11e8-8f17-c9de91b6f9d1

Upvotes: 0

Views: 2574

Answers (1)

Russ Cam
Russ Cam

Reputation: 125488

The searchIndentifier field is mapped as a text datatype, which will undergo analysis and use the Standard Analyzer by default. Using the Analyze API, you can see what terms will be stored in the inverted index for 1-00917751

var client = new ElasticClient();

var analyzeResponse = client.Analyze(a => a
    .Text("1-00917751")
);

which returns

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "00917751",
      "start_offset" : 2,
      "end_offset" : 10,
      "type" : "<NUM>",
      "position" : 1
    }
  ]
}

You'll get a match for the query_string query with a query input of 00917751 as this matches one of the terms stored in the inverted index as a result of analysis at index time for the input 1-00917751.

You won't get a match for 917751 as there is not a term in the inverted index that will match. You could define an analysis chain that removes leading zeroes from numbers as well as preserving the original token e.g.

private static void Main()
{
    var defaultIndex = "foobarbaz";
    var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));

    var settings = new ConnectionSettings(pool)
        .DefaultIndex(defaultIndex);

    var client = new ElasticClient(settings);

    client.CreateIndex(defaultIndex, c => c
        .Settings(s => s
            .Analysis(a => a
                .Analyzers(an => an
                    .Custom("trim_leading_zero", ca => ca
                        .Tokenizer("standard")
                        .Filters(
                            "standard", 
                            "lowercase", 
                            "trim_leading_zero",
                            "trim_zero_length")
                    )
                )
                .TokenFilters(tf => tf
                    .PatternReplace("trim_leading_zero", pr => pr
                        .Pattern("^0+(.*)")
                        .Replacement("$1")
                    )
                    .Length("trim_zero_length", t => t
                        .Min(1)
                    )
                )
            )
        )
        .Mappings(m => m
            .Map<MyDocument>(mm => mm
                .AutoMap()
                .Properties(p => p
                    .Text(t => t
                        .Name(n => n.SearchIndentifier)
                        .Analyzer("trim_leading_zero")
                        .Fields(f => f
                            .Keyword(k => k
                                .Name("keyword")
                                .IgnoreAbove(256)
                            )
                        )
                    )
                )
            )
        )
    );

    client.Index(new MyDocument { SearchIndentifier = "1-00917751" }, i => i
        .Refresh(Refresh.WaitFor)
    );

    client.Search<MyDocument>(s => s
        .Query(q => q
            .QueryString(qs => qs
                .Query("917751")
            )
        )
    );
}

public class MyDocument 
{
    public string SearchIndentifier { get; set; }
}

The pattern_replacement token filter will trim leading zeroes from tokens.

the search query returns the indexed document

{
  "took" : 69,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.33310556,
    "hits" : [
      {
        "_index" : "foobarbaz",
        "_type" : "mydocument",
        "_id" : "MVF4bmYBJZHQAT-BUx1K",
        "_score" : 0.33310556,
        "_source" : {
          "searchIndentifier" : "1-00917751"
        }
      }
    ]
  }
}

Upvotes: 1

Related Questions