How to filter tokens based on a regex in ElasticSearch

Question

For a ElasticSearch query we want to handle words (i.e. tokens only consisting of letters) and non-words differently. To do this we try to define two analyzers either returning the words or the non-words.

For example we have documents describing products for a hardware store:

{
    "name": "Torx drive T9",
    "category": "screws",
    "size": 2.5,
}

The user would then search for "Torx T9" and expect to find this document. Searching for T9 would be too generic and give too many non-relevant products. So we only want to search for the 'T9' term if we already found 'Torx'.

We try to create a query like this

{
    "query": {
        "bool": {
            "must": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "words"
                 }
             },
            "should": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "nonwords"
                 }
             }
         }
     }
}

The idea is that it would be simple to create token filters to do this. For example something like:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern",
           "pattern": "\A\p{L}*\Z",
        },
        "nonwords": {
            "type": "pattern",
            "pattern": "\P{L}",
        }
    }
}

But there doesn't seem to be a filter just matching on patterns. Instead we (ab)use the pattern_replace filter:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern_replace",
           "pattern": "\A((?=.*\P{L}).*)",
           "replacement": ""
        },
        "nonwords": {
            "type": "pattern_replace",
            "pattern": "\A((?!.*\P{L}).*)",
            "replacement": ""
        },
        "nonempty": {
            "type": "length",
            "min":1
        }
    }
}

This replaces the unwanted tokens with the empty token, which can then be removed by the nonempty filter. This seems to work, but the required patterns are more obscure.

Is there a better way to express this?

How to filter tokens based on a regex in ElasticSearch

Answers (1)

Related Questions