pmakholm
pmakholm

Reputation: 1568

How to filter tokens based on a regex in ElasticSearch

For a ElasticSearch query we want to handle words (i.e. tokens only consisting of letters) and non-words differently. To do this we try to define two analyzers either returning the words or the non-words.

For example we have documents describing products for a hardware store:

{
    "name": "Torx drive T9",
    "category": "screws",
    "size": 2.5,
}

The user would then search for "Torx T9" and expect to find this document. Searching for T9 would be too generic and give too many non-relevant products. So we only want to search for the 'T9' term if we already found 'Torx'.

We try to create a query like this

{
    "query": {
        "bool": {
            "must": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "words"
                 }
             },
            "should": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "nonwords"
                 }
             }
         }
     }
}

The idea is that it would be simple to create token filters to do this. For example something like:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern",
           "pattern": "\\A\\p{L}*\\Z",
        },
        "nonwords": {
            "type": "pattern",
            "pattern": "\\P{L}",
        }
    }
}

But there doesn't seem to be a filter just matching on patterns. Instead we (ab)use the pattern_replace filter:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern_replace",
           "pattern": "\\A((?=.*\\P{L}).*)",
           "replacement": ""
        },
        "nonwords": {
            "type": "pattern_replace",
            "pattern": "\\A((?!.*\\P{L}).*)",
            "replacement": ""
        },
        "nonempty": {
            "type": "length",
            "min":1
        }
    }
}

This replaces the unwanted tokens with the empty token, which can then be removed by the nonempty filter. This seems to work, but the required patterns are more obscure.

Is there a better way to express this?

Upvotes: 4

Views: 2202

Answers (1)

Vijay R.
Vijay R.

Reputation: 482

You can try query-string-query with default_operator as "AND" for you requirement.

For example consider you are indexing two strings "Torx drive T9" and "Square drive T9".If you use the whitespace tokenizer for indexing then the string will be analyzed as following tokens

First Document : torx, drive and t9.
Second Document : square, drive and t9.

Then using query string query to match documents with default operator as AND will produce the expected result.

Sample Mapping

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "whitespace"
        }
      }
    }
  }
}

Sample Query

{
   "query": {
    "query_string": {
       "default_field": "name",
       "query": "Torx T9",
       "default_operator": "AND"
        }
     }
 }

This query will yield result only when both torx and t9 presents in the document.

Upvotes: 1

Related Questions