notlkk
notlkk

Reputation: 1231

Fuzzy Like This on Attachment Returns Nothing on Partial Word

I have my mapping like this:

    {
      "doc": {
        "mappings": {
          "mydocument": {
            "properties": {
              "file": {
                "type": "attachment",
                "path": "full",
                "fields": {
                  "file": {
                    "type": "string",
                    "store": true,
                    "term_vector": "with_positions_offsets"
                  },
                  "author": {
...

When I search for a complete word I get the result:

  "query": {
        "fuzzy_like_this" : {
          "fields" : ["file"],
          "like_text" : "This_is_something_I_want_to_search_for",
          "max_query_terms" : 12
        }
    },
  "highlight" : {
    "number_of_fragments" : 3,
    "fragment_size" : 650,
    "fields" : {
      "file" : {  }
    }
  }   

But if I change the search term to "This_is_something_I_want" I get nothing. What am I missing?

Upvotes: 1

Views: 154

Answers (1)

Lourens
Lourens

Reputation: 1518

To implement a partial match, we must first understand what fuzzy like this does and then decide what you want partial matching to return. fuzzy like this will perform 2 key functions.

  1. The like_text will be analyzed using the default analyzer. All the resulting tokens will then be used to find documents based on term frequency, or tf-idf

This typically means that the input term will be be split on space and lowercased. This_is_something_I_want will therefore be tokenized to this_is_something_i_want. Unless you have files with this exact term, no documents will match.

  1. Secondly, all terms will be fuzzified. Fuzzy searches score terms based on how many character changes needs to made to a word to match another word. For instance to get from bat to hat we will need to make 1 character change.

For our case to get from this_is_something_i_want to this_is_something_i_want_to_search_for, we will need to make 14 character changes (adding _to_search_for.) Standard fuzzy search only allows for 3 character changes when working with terms longer that 5 or 6 characters. Increasing the fuzzy limit to 14 will however produce severely skewed results

So neither of these functions will help produce the results you seek.

Here is what I can suggest:

  1. You can implement an analyzer that splits on underscore similar to this. Tokens produced will then be ['this', 'is', 'something', 'i', 'want'] which can correctly be matched to to the sample case

  2. Alternatively, if all you want is a document that starts with the specified text, you can use a phrase prefix query instead of fuzzy like this. Documentations here

Upvotes: 1

Related Questions