Umang Kamdar
Umang Kamdar

Reputation: 53

Elasticsearch Query for Exact Substring Matching with Spaces

I'm trying to perform an exact substring match in Elasticsearch, including substrings that contain spaces. Here’s what I need:

  1. Search for an exact substring within a larger text field.
  2. The substring may contain spaces.
  3. The substring may be a partial word and not necessarily a full word.
  4. I want to match the substring exactly as it appears, not just individual terms.

I've tried the following approaches without success:

Wildcard Query:

{
  "query": {
    "wildcard": {
      "description": {
        "value": "*substring with spaces*",
        "case_insensitive": true
      }
    }
  }
}

Query String with Analyze Wildcard:

{
  "query": {
    "query_string": {
      "query": "*substring with spaces*",
      "fields": ["description"],
      "analyze_wildcard": true,
      "default_operator": "AND"
    }
  }
}

Neither approach returns the expected results when the substring contains spaces.

Example:

Field: description

Value: "quick brown fox"

Substring can be any of the followings - "quick brown fox", "quick bro", "quick" or "brown f" etc and I still want it to match exactly.

How can I construct an Elasticsearch query to achieve exact substring matching, including spaces and partial words?

Upvotes: 0

Views: 104

Answers (1)

imotov
imotov

Reputation: 30163

If partial words always begin at the beginning of the word you can use match_phrase_prefix:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text"
      }
    }
  }
}

POST test/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "description": "The quick brown fox"}
{ "index": { "_id": "2" } }
{ "description": "The slow red fox with brown spots wasn't that quick"}

POST test/_search
{
  "query": {
    "match_phrase": {
      "description": "quick brown fox"
    }
  }
}

POST test/_search
{
  "query": {
    "match_phrase_prefix": {
      "description": "quick bro"
    }
  }
}

POST test/_search
{
  "query": {
    "match_phrase_prefix": {
      "description": "quick"
    }
  }
}

POST test/_search
{
  "query": {
    "match_phrase_prefix": {
      "description": "brown f"
    }
  }
}

If you want to match any queries that can potentially start in the middle of the word and don't mind pretty bad performance degradation you can do this:

DELETE test
PUT test
{
  "mappings": {
    "properties": {
      "description": {
        "type": "keyword"
      }
    }
  }
}

POST test/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "description": "The quick brown fox"}
{ "index": { "_id": "2" } }
{ "description": " jumps over the lazy dog"}


POST test/_search
{
  "query": {
    "wildcard": {
      "description": {
        "value": "*ick bro*",
        "case_insensitive": true
      }
    }
  }
}

POST test/_search
{
  "query": {
    "wildcard": {
      "description": {
        "value": "*wn fox*",
        "case_insensitive": true
      }
    }
  }
}

In case you have millions of different descriptions, descriptions are exceeding the keyword limits of 32K or this is the primarily way of searching this field you can switch from keyword type to wildcard type. This will relax the keyword limitations but will come with increased storage and indexing costs.

Upvotes: 0

Related Questions