HotStuff68
HotStuff68

Reputation: 965

How should I index this schema in Elasticsearch

I am a bit lost on how to index these documents in Elasticsearch.

Document 1

{
    text: ['chicken']
}

Document 2

{
    text: ['chicken'], [['broth', 'stock']]
}

I need to be able to query these using either 'chicken flavored stock' or 'chicken flavored broth' and it should return both documents with the same score, since all of their terms have been matched in the input query. It also shouldn't return doc 2 with only 'chicken' as query.

Basically, I want to know that all the terms in 'text' field have been found somewhere in the query, and the internal array (ie: 'broth' and 'stock' acts like an OR clause).

Is this even possible?

Update:

I did find a (cumbersome) way of doing it. I save the document by combining their fields into phrases (ex: ['chicken broth', 'chicken stock'] for doc 2). Then I search using every combination of the input as a phrase (ex: ['chicken', 'chicken flavored', 'chicken flavored broth', 'chicken broth', ...].)

This solution does give me the results I want, but I can't help but feel this is a common case that could be handled much more elegantly. It feels like the ngrams are along the path to my answer, but I can't quite work it out.

Upvotes: 0

Views: 159

Answers (2)

vaidik
vaidik

Reputation: 2213

So here is something that you can try. Percolator can solve your problem but you will have to change the way you are indexing your documents.

So instead of indexing doc1 the way you are doing, index it like so:

PUT /test-index/.percolator/1
{
    "query": {
        "term": {
           "text": {
              "value": "chicken"
           }
        }
    }
}

And, index doc2 like so:

PUT /test-index/.percolator/2
{
   "query": {
      "bool": {
         "must": [
            {
               "term": {
                  "text": {
                     "value": "chicken"
                  }
               }
            },
            {
               "bool": {
                  "should": [
                     {
                        "term": {
                           "text": {
                              "value": "broth"
                           }
                        }
                     },
                     {
                        "term": {
                           "text": {
                              "value": "stock"
                           }
                        }
                     }
                  ]
               }
            }
         ]
      }
   }
}

No instead of querying the way you were querying your documents earlier, percolate them:

GET /test-index/all_terms_search/_percolate
{
    "doc": {
        "text": "chicken flavored stock"
    }
}

This will get both your documents. This also gives you the flexibility to control what and how much you want to match. While you are indexing your document's reverse queries in percolator, you provide an ID for that query and corresponding to that ID, you can maintain the text in a much simpler form for you to consume either in a separate index in Elasticsearch or may be some other datastore which can get matching documents really fast.

Upvotes: 0

Dan Tuffery
Dan Tuffery

Reputation: 5924

When you index documents without adding a custom mapping, Elasticsearch using the Standard analyzer by default.

You could remove the arrays from the text fields and index your documents as:

Document 1

{
   "text": "chicken"
}

Document 2

{
   "text": "chicken broth stock"
}

The standard analyzer will create the following tokens in the Lucene index:

Document 1

"chicken"

Document 2

"chicken", "broth", "stock"

Your documents are matching the search terms as follows:

chicken : the term 'chicken' matches in both documents, because the text field is shorter in Document 1 it scores higher than Document 2.

chicken flavored: the term 'chicken' matches in both documents, but there is no match for the term 'flavoured'. Again, as the text field is shorter in Document 1 it scores higher than Document 2.

chicken flavored broth: the term 'chicken' matches in both documents, and the term 'broth' also matched in document 2. There is no match on the term 'flavored' in either of the documents. Document 2 is scored higher than Document 1 as it matches two of the terms in the query.

I don't really see a use case for ngrams as the above does what you want.

Upvotes: 1

Related Questions