DevB2F
DevB2F

Reputation: 5095

no results when using whitespace in regex query

When I make this query:

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "regexp":{
            "main_text": ".*word r.*"
        }
    }
}
'

I get no results. But when I query:

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "regexp":{
            "main_text": ".*word.*"
        }
    }
}
'

I get results with word (including results with "word r..."). I am using Elasticsearch 6.2.2. Any idea what is going on?

Upvotes: 2

Views: 316

Answers (2)

Kamal Kunjapur
Kamal Kunjapur

Reputation: 8860

Let's say you have the below sentence

word raincoat bword wordcd

If the field main_text is of type text and if it uses default i.e. Standard Analyzer, then the sentence would be broken into below tokens

word raincoat bword wordcd

(Yup no spaces)

Now these words are actually which are stored in inverted index and when you query using match or even regex, it would try to match to these words.

Note that it doesn't save sentence as is for e.g. "word raincoat" nor it is saved as "word " (notice the space) in inverted index.

Now you are using regex .*word.* you would get documents having word, bword and wordcd 'coz that's what your inverted index has.

Again now when you use regex .*word r*, since inverted index doesn't save the "word raincoat" together, you wouldn't get the result.

What you can do is, have the field main_text of type keyword, in this case datatype keyword doesn't go through the analysis phase and therefore keeps the entire value saved as is in inverted index. Your regex *.word r.*, would then work as expected.

You always search inverted index, so you would get only what inverted index stores

In case if you need both partial search as well as exact search implementation, then I'd suggest you make use of multi-field for main_text or whatever field name you intend to.

Hope this helps!

Upvotes: 1

YonatanBM
YonatanBM

Reputation: 46

This is becuase regexp is a term query and not a fulltext query. You are probably using a whitespace tokenizer and then you wont ever find a token containg whitespace

Upvotes: 0

Related Questions