orokusaki
orokusaki

Reputation: 57168

Elasticsearch - search for substring spanning 2 words

Simple Example

I have a document with a name text field that typically contains addresses:

1234 Palm Tree Street NE, Miami, FL 33101

I would expect Elasticsearch to find the Palm Tree in the above address when I use wildcard with:

*alm Tre*

Instead, I get no results.

Rationale / realistic example

Sometimes the name field contains encoded information that spans 2 words, as follows:

R3358b7119 x3387HRL388

I'm using a wildcard with *<search phrase>*, which works when the user enters either 2 whole "words" or a single partial word. But, if the user enters the end of one word and the beginning of the next word, like b7119 x3387 (using the example above) the document isn't returned.

Regexp doesn't seem to be a possible solution :(

I tried to use regexp search:

{'regexp': {'name': '.*b7119 x3387.*'}}

But even that did not return the document.

I'm truly at a loss...

Upvotes: 2

Views: 68

Answers (2)

Pierre-Nicolas Mougel
Pierre-Nicolas Mougel

Reputation: 2279

In case you are not already aware, regexp with .* are computational expensive. A more elasticsearch way solution would be to use analyzers to handle your problem.

You can create a field without the whitespaces and use an ngram analyzer to split your text into sub tokens. This solution should be much faster, but will require much more disk space to store all the subsets.

Upvotes: 1

A l w a y s S u n n y
A l w a y s S u n n y

Reputation: 38542

First of all for regexp to work, you need to set mapping for your name with not_analyzed because Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field

"type": {
   "properties": {
      "name": {
         "type": "string",
         "index": "not_analyzed",
         "store": true
      }
   }
}

Upvotes: 2

Related Questions