Jared Dunham
Jared Dunham

Reputation: 1527

Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field

Example data:

PUT /test/test/1
{
    "text1":"cats meow",
    "text2":"12345",
    "text3":"toy"
}

PUT /test/test/2
{
    "text1":"dog bark",
    "text2":"98765",
    "text3":"toy"
}

And an example query:

GET /test/test/_search
{
    "size": 25,
    "query": {
        "multi_match" : {
            "fields" : [
                "text1", 
                "text2",
                "text3"
            ],
            "query" : "meow cats toy",
            "type" : "cross_fields"
        }
    }
}

Returns the cat hit first and then the dog, which is what I want.

BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.

So if I search:

GET /test/test/_search
{
    "size": 25,
    "query": {
        "multi_match" : {
            "fields" : [
                "text1", 
                "text2",
                "text3"
            ],
            "query" : "cat toy",
            "type" : "phrase_prefix"
        }
    }
}

or

GET /test/test/_search
{
    "size": 25,
    "query": {
        "multi_match" : {
            "fields" : [
                "text1", 
                "text2",
                "text3"
            ],
            "query" : "meow cats",
            "type" : "phrase_prefix"
        }
    }
}

I should get the cat/ID 1, but I did not.

I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...

Sifting through the documentation really isn't helping me discover how to combine these two.

Upvotes: 2

Views: 1145

Answers (1)

Jared Dunham
Jared Dunham

Reputation: 1527

Yeah, I had to apply an analyzer...

The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.

The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.

So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.

~~~

Here are the steps to make the above question example work:

First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.

PUT /test
{
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

Then I attached this analyzer to each field. I changed the text1 to match the field I was applying this to.

PUT /test/_mapping/test
{
    "test": {
        "properties": {
            "text1": {
                "type":     "string",
                "analyzer": "autocomplete"
            }
        }
    }
}

I ran GET /test/_mapping to be sure everything worked.

Then to add the data:

POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }

And the search!

{
    "size": 25,
    "query": {
        "multi_match" : {
            "fields" : [
                "text1", 
                "text2",
                "text3"
            ],
            "query" : "cat toy",
            "type" : "cross_fields"
        }
    }
}

Which returns:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.70778143,
      "hits": [
         {
            "_index": "test",
            "_type": "test",
            "_id": "1",
            "_score": 0.70778143,
            "_source": {
               "text1": "cats meow",
               "text2": "12345",
               "text3": "toy"
            }
         },
         {
            "_index": "test",
            "_type": "test",
            "_id": "2",
            "_score": 0.1278426,
            "_source": {
               "text1": "dog bark",
               "text2": "98765",
               "text3": "toy"
            }
         }
      ]
   }
}

This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

Upvotes: 1

Related Questions