Elasticsearch: match a pattern using regexp with a specific order

Question

I would like to know if it is possible to use regexp in Elasticsearch to match a pattern with different strings in a specific order.

For example for the string: still_adv be_aux pick_verb up_adp the_det bus_noun I want to match the combination of words with the tag ADV+AUX+VERB so here it will be still_adv be_aux pick_verb.

I will use this regexp:

{
   "query_string": 
    {
     "fields": ["sentences_features.tagger.annotation],
     "query": "*(.*_adv) (be_aux) (.*_verb)*"
    }
}

However, this regexp is not working and matches each word separately.

Joe - Check out my books · Accepted Answer

You could wrap the whole group in parentheses:

*((.*_adv) (be_aux) (.*_verb))*

But for future reference, it'd be better to split these annotation tags into more easily searchable key-value pairs like:

[ {word_type: 'adv', text: 'still', position: 0 }, {...}, ... ]

It's more work at the beginning but will come in handy later.

EDIT

After setting up an index with a keyword field mapping

PUT myind
{
  "mappings": {
    "properties": {
      "annot": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

and syncing a few docs

// valid
POST myind/_doc
{
  "annot": "and_CCONJ other_ADJ driver_NOUN_b1 professional_ADJ"
}

// valid
POST myind/_doc
{
  "annot": "xyz other_ADJ driver_NOUN_b1 xyz"
}

// invalid
POST myind/_doc
{
  "annot": "and_CCONJ other_ADJ professional_ADJ driver_NOUN_b1"
}

we can use the a regexp query on the .keyword like so:

GET myind/_search
{
  "query": {
    "regexp": {
      "annot.keyword": "(.* )?other_ADJ [a-zA-Z*]*_NOUN_b1( .*)?"
    }
  }
}

and if you don't care what's in between of the two tokens you can use

(.* )?other_ADJ( .*)?[a-zA-Z*]*_NOUN_b1( .*)?

For HTML tag stripping check this answer.

Elasticsearch: match a pattern using regexp with a specific order

Answers (1)

Related Questions