Reputation: 437
I would like to know if it is possible to use regexp in Elasticsearch to match a pattern with different strings in a specific order.
For example for the string: still_adv be_aux pick_verb up_adp the_det bus_noun
I want to match the combination of words with the tag ADV+AUX+VERB so here it will be still_adv be_aux pick_verb
.
I will use this regexp:
{
"query_string":
{
"fields": ["sentences_features.tagger.annotation],
"query": "*(.*_adv) (be_aux) (.*_verb)*"
}
}
However, this regexp is not working and matches each word separately.
Upvotes: 0
Views: 538
Reputation: 16933
You could wrap the whole group in parentheses:
*((.*_adv) (be_aux) (.*_verb))*
But for future reference, it'd be better to split these annotation tags into more easily searchable key-value pairs like:
[ {word_type: 'adv', text: 'still', position: 0 }, {...}, ... ]
It's more work at the beginning but will come in handy later.
EDIT
After setting up an index with a keyword
field mapping
PUT myind
{
"mappings": {
"properties": {
"annot": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and syncing a few docs
// valid
POST myind/_doc
{
"annot": "and_CCONJ other_ADJ driver_NOUN_b1 professional_ADJ"
}
// valid
POST myind/_doc
{
"annot": "xyz other_ADJ driver_NOUN_b1 xyz"
}
// invalid
POST myind/_doc
{
"annot": "and_CCONJ other_ADJ professional_ADJ driver_NOUN_b1"
}
we can use the a regexp
query on the .keyword
like so:
GET myind/_search
{
"query": {
"regexp": {
"annot.keyword": "(.* )?other_ADJ [a-zA-Z*]*_NOUN_b1( .*)?"
}
}
}
and if you don't care what's in between of the two tokens you can use
(.* )?other_ADJ( .*)?[a-zA-Z*]*_NOUN_b1( .*)?
For HTML tag stripping check this answer.
Upvotes: 1